from:"john . hubbard"

Re: [PATCH v6] Documentation/gpu: VM_BIND locking document

2023-11-29 Thread John Hubbard


On 11/29/23 01:06, Thomas Hellström wrote:

Add the first version of the VM_BIND locking document which is
intended to be part of the xe driver upstreaming agreement.

The document describes and discuss the locking used during exec-
functions, evicton and for userptr gpu-vmas. Intention is to be using the
same nomenclature as the drm-vm-bind-async.rst.



Hi Thomas,

As requested, for the pin_user_pages() aspects (the MMU notifier
registration case), please feel free to add:

Acked-by: John Hubbard 
v2:
- s/gvm/gpu_vm/g (Rodrigo Vivi)
- Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c
   (Rodrigo Vivi)
- Adjust commit message accordingly.
- Add SPDX license header.

v3:
- Large update to align with the drm_gpuvm manager locking
- Add "Efficient userptr gpu_vma exec function iteration" section
- Add "Locking at bind- and unbind time" section.

v4:
- Fix tabs vs space errors by untabifying (Rodrigo Vivi)
- Minor style fixes and typos (Rodrigo Vivi)
- Clarify situations where stale GPU mappings are occurring and how
   access through these mappings are blocked. (Rodrigo Vivi)
- Insert into the toctree in implementation_guidelines.rst

v5:
- Add a section about recoverable page-faults.
- Use local references to other documentation where possible
   (Bagas Sanjaya)
- General documentation fixes and typos (Danilo Krummrich and
   Boris Brezillon)
- Improve the documentation around locks that need to be grabbed from the
   dm-fence critical section (Boris Brezillon)
- Add more references to the DRM GPUVM helpers (Danilo Krummrich and
   Boriz Brezillon)
- Update the rfc/xe.rst document.

v6:
- Rework wording to improve readability (Boris Brezillon, Rodrigo Vivi,
   Bagas Sanjaya)
- Various minor fixes across the document (Boris Brezillon)

Cc: Rodrigo Vivi 
Signed-off-by: Thomas Hellström 
Reviewed-by: Boris Brezillon 
Reviewed-by: Rodrigo Vivi 
Reviewed-by: Danilo Krummrich 
---
  Documentation/core-api/pin_user_pages.rst |   2 +
  Documentation/gpu/drm-mm.rst  |   4 +
  Documentation/gpu/drm-vm-bind-locking.rst | 582 ++
  .../gpu/implementation_guidelines.rst |   1 +
  Documentation/gpu/rfc/xe.rst  |   5 +
  5 files changed, 594 insertions(+)
  create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index d3c1f6d8c0e0..6b5f7e6e7155 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -153,6 +153,8 @@ NOTE: Some pages, such as DAX pages, cannot be pinned with 
longterm pins. That's
  because DAX pages do not have a separate page cache, and so "pinning" implies
  locking down file system blocks, which is not (yet) supported in that way.
  
+.. _mmu-notifier-registration-case:

+
  CASE 3: MMU notifier registration, with or without page faulting hardware
  -
  Device drivers can pin pages via get_user_pages*(), and register for mmu
diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index acc5901ac840..d55751cad67c 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -466,6 +466,8 @@ DRM MM Range Allocator Function References
  .. kernel-doc:: drivers/gpu/drm/drm_mm.c
 :export:
  
+.. _drm_gpuvm:

+
  DRM GPUVM
  =
  
@@ -481,6 +483,8 @@ Split and Merge

  .. kernel-doc:: drivers/gpu/drm/drm_gpuvm.c
 :doc: Split and Merge
  
+.. _drm_gpuvm_locking:

+
  Locking
  ---
  
diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst

new file mode 100644
index ..a345aa513d12
--- /dev/null
+++ b/Documentation/gpu/drm-vm-bind-locking.rst
@@ -0,0 +1,582 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+===
+VM_BIND locking
+===
+
+This document attempts to describe what's needed to get VM_BIND locking right,
+including the userptr mmu_notifier locking. It also discusses some
+optimizations to get rid of the looping through of all userptr mappings and
+external / shared object mappings that is needed in the simplest
+implementation. In addition, there is a section describing the VM_BIND locking
+required for implementing recoverable pagefaults.
+
+The DRM GPUVM set of helpers
+
+
+There is a set of helpers for drivers implementing VM_BIND, and this
+set of helpers implements much, but not all of the locking described
+in this document. In particular, it is currently lacking a userptr
+implementation. This document does not intend to describe the DRM GPUVM
+implementation in detail, but it is covered in :ref:`its own
+documentation `. It is highly recommended for any driver
+implementing VM_BIND to use the DRM GPUVM helpers and to extend it if
+common functionality is missing.
+
+Nomenclature
+===

Re: [PATCH v5 1/6] mm/gup: remove unused vmas parameter from get_user_pages()

2023-05-16 Thread John Hubbard

On 5/16/23 07:35, David Hildenbrand wrote:
...
>>> When passing NULL as "pages" to get_user_pages(), __get_user_pages_locked()
>>> won't set FOLL_GET. As FOLL_PIN is also not set, we won't be messing with
>>> the mapcount of the page.
> 
> For completeness: s/mapcount/refcount/ :)

whew, you had me going there! Now it all adds up. :) 

thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH mm-unstable v1 09/20] mm/gup: reliable R/O long-term pinning in COW mappings

2022-11-23 Thread John Hubbard


On 11/16/22 02:26, David Hildenbrand wrote:
...

With this change, the new R/O long-term pinning tests for non-anonymous
memory succeed:
   # [RUN] R/O longterm GUP pin ... with shared zeropage
   ok 151 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with memfd
   ok 152 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with tmpfile
   ok 153 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with huge zeropage
   ok 154 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
   ok 155 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
   ok 156 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with shared zeropage
   ok 157 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with memfd
   ok 158 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with tmpfile
   ok 159 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with huge zeropage
   ok 160 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
   ok 161 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
   ok 162 Longterm R/O pin is reliable


Yes. I was able to reproduce these results, after some minor distractions
involving huge pages, don't ask. :)



Note 1: We don't care about short-term R/O-pinning, because they have
snapshot semantics: they are not supposed to observe modifications that
happen after pinning.

As one example, assume we start direct I/O to read from a page and store
page content into a file: modifications to page content after starting
direct I/O are not guaranteed to end up in the file. So even if we'd pin
the shared zeropage, the end result would be as expected -- getting zeroes
stored to the file.

Note 2: For shared mappings we'll now always fallback to the slow path to
lookup the VMA when R/O long-term pining. While that's the necessary price
we have to pay right now, it's actually not that bad in practice: most
FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
FOLL_FORCE because they tried dealing with COW mappings correctly ...

Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
populate exclusive anon pages that we can pin. There was a concern that
this could affect the memlock limit of existing setups.

For example, a VM running with VFIO could run into the memlock limit and
fail to run. However, we essentially had the same behavior already in
commit 17839856fd58 ("gup: document and work around "COW can break either
way" issue") which got merged into some enterprise distros, and there were
not any such complaints. So most probably, we're fine.

Signed-off-by: David Hildenbrand 
---
  include/linux/mm.h | 27 ---
  mm/gup.c   | 10 +-
  mm/huge_memory.c   |  2 +-
  mm/hugetlb.c   |  7 ---
  4 files changed, 34 insertions(+), 12 deletions(-)



Looks good,

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6bd2ee5872dd..e8cc838f42f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3095,8 +3095,12 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, 
int foll_flags)
   * Must be called with the (sub)page that's actually referenced via the
   * page table entry, which might not necessarily be the head page for a
   * PTE-mapped THP.
+ *
+ * If the vma is NULL, we're coming from the GUP-fast path and might have
+ * to fallback to the slow path just to lookup the vma.
   */
-static inline bool gup_must_unshare(unsigned int flags, struct page *page)
+static inline bool gup_must_unshare(struct vm_area_struct *vma,
+   unsigned int flags, struct page *page)
  {
/*
 * FOLL_WRITE is implicitly handled correctly as the page table entry
@@ -3109,8 +3113,25 @@ static inline bool gup_must_unshare(unsigned int flags, 
struct page *page)
 * Note: PageAnon(page) is stable until the page is actually getting
 * freed.
 */
-   if (!PageAnon(page))
-   return false;
+   if (!PageAnon(page)) {
+   /*
+* We only care about R/O long-term pining: R/O short-term
+* pinning does not have the semantics to observe successive
+* changes through the process page tables.
+*/
+   if (!(flags & FOLL_LONGTERM))
+   return false;
+
+   /* We really need the vma ... */
+   if (!vma)
+   return true;
+
+   /*
+* ... because we only care about writable private ("COW")
+* mapping

Re: [RFC PATCH 0/3] new subsystem for compute accelerator devices

2022-10-24 Thread John Hubbard


On 10/24/22 05:43, Oded Gabbay wrote:

Hi Oded,

The patches make sense to me. I'm still just reading through and looking
for minor issues, but at a high level it seems to match what the LPC
discussions pointed to.


What's your opinion on the long-term prospect of DRM vs accel? I assume
that over time, DRM helpers will move into accel and some DRM drivers
will start depending on accel?

I don't think that is what I had in mind.
What I had in mind is that accel helpers are only relevant for accel
drivers, and any code that might also be relevant for DRM drivers will
be placed in DRM core code. e.g. GEM enhancements, RAS netlink


Yes. That is how I understood it ("it" being both the LPC discussions,
and this patchset) as well:

* accel-only code goes in drivers/accel, thus allowing for
  smaller, simpler drivers (as compared to full drm) for that case.

* graphics and display code still goes in drivers/gpu/drm, because
  it is much too hard to rename or move that directory.

* code common to both also goes in drivers/gpu/drm.

Looking ahead a bit more:

For full-featured GPUs that do both Graphics and Compute, I expect
that a *lot* of the code will end up in drivers/gpu/drm. Because so
much of setting up for Compute is also really just setting up for
Graphics--that's how it evolved, after all!

And as things are structured now, it looks like those full featured
GPU stacks will also need an aux bus (which I only just now learned
about, but it looks quite helpful here). And also, user space will
need to open both /dev/dri/* and /dev/accel/* nodes, if it needs
access to anything live objects that drivers/accel owns.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-26 Thread John Hubbard

On 9/26/22 14:35, Lyude Paul wrote:
>> +for (i = 0; i < npages; i++) {
>> +if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
>> +struct page *dpage;
>> +
>> +/*
>> + * _GFP_NOFAIL because the GPU is going away and there
>> + * is nothing sensible we can do if we can't copy the
>> + * data back.
>> + */
> 
> You'll have to excuse me for a moment since this area of nouveau isn't one of
> my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means infinite
> retry, in the case of a GPU hotplug event I would assume we would rather just
> stop trying to migrate things to the GPU and just drop the data instead of
> hanging on infinite retries.
> 
Hi Lyude!

Actually, I really think it's better in this case to keep trying
(presumably not necessarily infinitely, but only until memory becomes
available), rather than failing out and corrupting data.

That's because I'm not sure it's completely clear that this memory is
discardable. And at some point, we're going to make this all work with
file-backed memory, which will *definitely* not be discardable--I
realize that we're not there yet, of course.

But here, it's reasonable to commit to just retrying indefinitely,
really. Memory should eventually show up. And if it doesn't, then
restarting the machine is better than corrupting data, generally.

thanks,

-- 
John Hubbard
NVIDIA

Re: [PATCH v2 1/3] migrate.c: Remove vma check in migrate_vma_setup()

2022-02-15 Thread John Hubbard


On 2/6/22 20:26, Alistair Popple wrote:

migrate_vma_setup() checks that a valid vma is passed so that the page
tables can be walked to find the pfns associated with a given address
range. However in some cases the pfns are already known, such as when
migrating device coherent pages during pin_user_pages() meaning a valid
vma isn't required.

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
---

Changes for v2:

  - Added Felix's Acked-by

  mm/migrate.c | 34 +-
  1 file changed, 17 insertions(+), 17 deletions(-)



Hi Alistair,

Another late-breaking review question, below. :)


diff --git a/mm/migrate.c b/mm/migrate.c
index a9aed12..0d6570d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2602,24 +2602,24 @@ int migrate_vma_setup(struct migrate_vma *args)
  
  	args->start &= PAGE_MASK;

args->end &= PAGE_MASK;
-   if (!args->vma || is_vm_hugetlb_page(args->vma) ||
-   (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
-   return -EINVAL;
-   if (nr_pages <= 0)
-   return -EINVAL;


Was the above check left out intentionally? If so, then it needs a
commit description note. And maybe even a separate patch, because it
changes the behavior.

If you do want to change the behavior:

* The kerneldoc comment above this function supports such a change: it
requires returning 0 for the case of zero pages requested. So your
change would bring the comments into alignment with the code.

* I don't think memset deals properly with a zero length input arg, so
it's probably better to return 0, before that point.


thanks,
--
John Hubbard
NVIDIA


-   if (args->start < args->vma->vm_start ||
-   args->start >= args->vma->vm_end)
-   return -EINVAL;
-   if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end)
-   return -EINVAL;
if (!args->src || !args->dst)
return -EINVAL;
-
-   memset(args->src, 0, sizeof(*args->src) * nr_pages);
-   args->cpages = 0;
-   args->npages = 0;
-
-   migrate_vma_collect(args);
+   if (args->vma) {
+   if (is_vm_hugetlb_page(args->vma) ||
+   (args->vma->vm_flags & VM_SPECIAL) || 
vma_is_dax(args->vma))
+   return -EINVAL;
+   if (args->start < args->vma->vm_start ||
+   args->start >= args->vma->vm_end)
+   return -EINVAL;
+   if (args->end <= args->vma->vm_start || args->end > 
args->vma->vm_end)
+   return -EINVAL;
+
+   memset(args->src, 0, sizeof(*args->src) * nr_pages);
+   args->cpages = 0;
+   args->npages = 0;
+
+   migrate_vma_collect(args);
+   }
  
  	if (args->cpages)

migrate_vma_unmap(args);
@@ -2804,7 +2804,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
continue;
}
  
-		if (!page) {

+   if (!page && migrate->vma) {
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
continue;
if (!notified) {

Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-11 Thread John Hubbard


On 2/11/22 18:51, Alistair Popple wrote:
...

@@ -1888,15 +1942,40 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
continue;
prev_head = head;
/*
-* If we get a movable page, since we are going to be pinning
-* these entries, try to move them out if possible.
+* Device coherent pages are managed by a driver and should not
+* be pinned indefinitely as it prevents the driver moving the
+* page. So when trying to pin with FOLL_LONGTERM instead try
+* migrating page out of device memory.
 */
if (is_dev_private_or_coherent_page(head)) {
+   /*
+* device private pages will get faulted in during gup
+* so it shouldn't be possible to see one here.
+*/
WARN_ON_ONCE(is_device_private_page(head));
-   ret = -EFAULT;
-   goto unpin_pages;
+   WARN_ON_ONCE(PageCompound(head));
+
+   /*
+* migration will fail if the page is pinned, so convert
+* the pin on the source page to a normal reference.
+*/
+   if (gup_flags & FOLL_PIN) {
+   get_page(head);
+   unpin_user_page(head);


OK...but now gup_flags can no longer be used as a guide for how to
release these pages, right? In other words, up until this point,
FOLL_PIN meant "call unpin_user_page() in order to release". However,
now this page must be released via put_page().


This is the source page (head). We are unpinning it because we can't migrate a
pinned page, however we still need a reference on it for migrate_vma hence the
get_page followed by unpin. In the non-FOLL_PIN case we already have a
reference from gup.


See below...


+   }
+
+   pages[i] = migrate_device_page(head, gup_flags);


migrate_device_page() will return a new page that has been correctly pinned
with gup_flags by try_grab_page(). Therefore this page can still be released
with unpin_user_page() or put_page() as appropriate for the given gup_flags.

The reference we had on the source page (head) always gets dropped in
migrate_vma_finalize().


OK. Good.

The above would be good to have in a comment, right around here, imho.
Because we have this marvelous mix of references for migration (get_page())
and other, and it's a bit hard to see that it's all correct without a
hint or two.

...


Which unless I've missed something is still the correct thing to do.


This reminds me: out of the many things to monitor, the FOLL_PIN counts
in /proc/vmstat are especially helpful, whenever making changes to code
that deals with this:

nr_foll_pin_acquired
nr_foll_pin_released

...and those should normally be equal to each other when "at rest".



I hope this is/was run, just to be sure?


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2 2/3] mm/gup.c: Migrate device coherent pages when pinning instead of failing

2022-02-11 Thread John Hubbard

pages[i] = migrate_device_page(head, gup_flags);
+   if (!pages[i]) {
+   ret = -EBUSY;
+   break;
+   }
+   continue;
}
  
+		/*

+* If we get a movable page, since we are going to be pinning
+* these entries, try to move them out if possible.
+*/
if (!is_pinnable_page(head)) {
if (PageHuge(head)) {
if (!isolate_huge_page(head, 
_page_list))
@@ -1924,16 +2003,22 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
 * If list is empty, and no isolation errors, means that all pages are
 * in the correct zone.
 */
-   if (list_empty(_page_list) && !isolation_error_count)
+   if (!ret && list_empty(_page_list) && !isolation_error_count)
return nr_pages;
  
-unpin_pages:

-   if (gup_flags & FOLL_PIN) {
-   unpin_user_pages(pages, nr_pages);
-   } else {
-   for (i = 0; i < nr_pages; i++)
+   for (i = 0; i < nr_pages; i++)
+   if (!pages[i])
+   continue;
+   else if (gup_flags & FOLL_PIN)
+   unpin_user_page(pages[i]);


...and here, for example, we are still trusting gup_flags to decide how
to release the pages.

This reminds me: out of the many things to monitor, the FOLL_PIN counts
in /proc/vmstat are especially helpful, whenever making changes to code
that deals with this:

nr_foll_pin_acquired
nr_foll_pin_released

...and those should normally be equal to each other when "at rest".


thanks,
--
John Hubbard
NVIDIA


+   else
put_page(pages[i]);
+
+   if (ret && !list_empty(_page_list)) {
+   putback_movable_pages(_page_list);
+   return ret;
}
+
if (!list_empty(_page_list)) {
ret = migrate_pages(_page_list, alloc_migration_target,
NULL, (unsigned long), MIGRATE_SYNC,

Re: Phyr Starter

2022-01-20 Thread John Hubbard


On 1/20/22 6:12 AM, Christoph Hellwig wrote:

On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:

Zooming in on the pinning aspect for a moment: last time I attempted to
convert O_DIRECT callers from gup to pup, I recall wanting very much to
record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
or some non-FOLL_PIN method. Because at the end of the IO, it is not
easy to disentangle which pages require put_page() and which require
unpin_user_page*().


I don't think that is a problem.  Pinning only need to happen for
ITER_IOVEC, and the only non-user pages there is the ZERO_PAGE added
for padding that can be special cased.


I am really glad to hear you say that. Because I just worked through it
again in detail yesterday (including your and others' old emails about
this), and tentatively reached the same conclusion from seeing the call 
paths. But I wanted to confirm with someone who actually knows this code 
well, and that's not me. :)


Things like dio_refill_pages() are mixing in the zero page, but like you 
say, that can be handled. I have a few ideas for that.


Now that the goal is a considerably narrower as compared to in 2019 
("convert DIO callers to pup", instead of "convert the world to pup", 
ha), this looks quite feasible after all.



thanks,
--
John Hubbard
NVIDIA

Re: Phyr Starter

2022-01-11 Thread John Hubbard


On 1/10/22 11:34, Matthew Wilcox wrote:

TLDR: I want to introduce a new data type:

struct phyr {
 phys_addr_t addr;
 size_t len;
};

and use it to replace bio_vec as well as using it to replace the array
of struct pages used by get_user_pages() and friends.

---


This would certainly solve quite a few problems at once. Very compelling.

Zooming in on the pinning aspect for a moment: last time I attempted to
convert O_DIRECT callers from gup to pup, I recall wanting very much to
record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
or some non-FOLL_PIN method. Because at the end of the IO, it is not
easy to disentangle which pages require put_page() and which require
unpin_user_page*().

And changing the bio_vec for *that* purpose was not really acceptable.

But now that you're looking to change it in a big way (and with some
spare bits avaiable...oohh!), maybe I can go that direction after all.

Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
if it exists at all?

Or any other thoughts in this area are very welcome.



There are two distinct problems I want to address: doing I/O to memory
which does not have a struct page and efficiently doing I/O to large
blobs of physically contiguous memory, regardless of whether it has a
struct page.  There are some other improvements which I regard as minor.

There are many types of memory that one might want to do I/O to that do
not have a struct page, some examples:
  - Memory on a graphics card (or other PCI card, but gfx seems to be
the primary provider of DRAM on the PCI bus today)
  - DAX, or other pmem (there are some fake pages today, but this is
mostly a workaround for the IO problem today)
  - Guest memory being accessed from the hypervisor (KVM needs to
create structpages to make this happen.  Xen doesn't ...)
All of these kinds of memories can be addressed by the CPU and so also
by a bus master.  That is, there is a physical address that the CPU
can use which will address this memory, and there is a way to convert
that to a DMA address which can be programmed into another device.
There's no intent here to support memory which can be accessed by a
complex scheme like writing an address to a control register and then
accessing the memory through a FIFO; this is for memory which can be
accessed by DMA and CPU loads and stores.

For get_user_pages() and friends, we currently fill an array of struct
pages, each one representing PAGE_SIZE bytes.  For an application that
is using 1GB hugepages, writing 2^18 entries is a significant overhead.
It also makes drivers hard to write as they have to recoalesce the
struct pages, even though the VM can tell it whether those 2^18 pages
are contiguous.

On the minor side, struct phyr can represent any mappable chunk of memory.
A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
can represent larger than 4GB.  A phyr is the same size as a bio_vec
on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
It is smaller for 32-bit machines without PAE (8 bytes instead of 12).

Finally, it may be possible to stop using scatterlist to describe the
input to the DMA-mapping operation.  We may be able to get struct
scatterlist down to just dma_address and dma_length, with chaining
handled through an enclosing struct.

I would like to see phyr replace bio_vec everywhere it's currently used.
I don't have time to do that work now because I'm busy with folios.
If someone else wants to take that on, I shall cheer from the sidelines.


I'm starting to wonder if I should jump in here, in order to get this
as a way to make the O_DIRECT conversion much cleaner. But let's see.


What I do intend to do is:

  - Add an interface to gup.c to pin/unpin N phyrs
  - Add a sg_map_phyrs()
This will take an array of phyrs and allocate an sg for them
  - Whatever else I need to do to make one RDMA driver happy with
this scheme

At that point, I intend to stop and let others more familiar with this
area of the kernel continue the conversion of drivers.

P.S. If you've had the Prodigy song running through your head the whole
time you've been reading this email ... I'm sorry / You're welcome.
If people insist, we can rename this to phys_range or something boring,
but I quite like the spelling of phyr with the pronunciation of "fire".


A more conservative or traditional name might look like:

phys_vec (maintains some resemblance to what it's replacing)
phys_range
phys_addr_range

phyr is rather cool, but it also is awfully too close to "phys" for
reading comfort. And there is a lot to be said for self-descriptive names,
which phyr is not.

And of course, you're signing for another huge naming debate with Linus
if you go with the "cool" name here. :)


thanks,
--
John Hubbard
NVIDIA

Re: [RFC] Make use of non-dynamic dmabuf in RDMA

2021-08-25 Thread John Hubbard

On 8/24/21 11:17 PM, Christian König wrote:
...

I think it depends on the user, if the user creates memory which is
permanently located on the GPU then it should be pinnable in this way
without force migration. But if the memory is inherently migratable
then it just cannot be pinned in the GPU at all as we can't
indefinately block migration from happening eg if the CPU touches it
later or something.

Yes, exactly that's the point. Especially GPUs have a great variety of setups.

For example we have APUs where the local memory is just stolen system memory and all buffers must be
migrate-able because you might need all of this stolen memory for scanout or page tables. In this
case P2P only makes sense to avoid the migration overhead in the first place.

Then you got dGPUs where only a fraction of the VRAM is accessible from the PCIe BUS. Here you also
absolutely don't want to pin any buffers because that can easily crash when we need to migrate
something into the visible window for CPU access.

The only real option where you could do P2P with buffer pinning are those compute boards where we
know that everything is always accessible to everybody and we will never need to migrate anything.
But even then you want some mechanism like cgroups to take care of limiting this. Otherwise any
runaway process can bring down your whole system.

Key question at least for me as GPU maintainer is if we are going to see modern compute boards
together with old non-ODP setups. Since those compute boards are usually used with new hardware
(like PCIe v4 for example) the answer I think is most likely "no".

That is a really good point. Times have changed and I guess ODP is on most
(all?) of
the new Infiniband products now, and maybe we don't need to worry so much about
providing first-class support for non-ODP setups.

I've got to drag my brain into 2021+! :)

thanks,
--
John Hubbard
NVIDIA

Re: [RFC] Make use of non-dynamic dmabuf in RDMA

2021-08-24 Thread John Hubbard


On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
...

And yes at least for the amdgpu driver we migrate the memory to host
memory as soon as it is pinned and I would expect that other GPU drivers
do something similar.


Well...for many topologies, migrating to host memory will result in a
dramatically slower p2p setup. For that reason, some GPU drivers may
want to allow pinning of video memory in some situations.

Ideally, you've got modern ODP devices and you don't even need to pin.
But if not, and you still hope to do high performance p2p between a GPU
and a non-ODP Infiniband device, then you would need to leave the pinned
memory in vidmem.

So I think we don't want to rule out that behavior, right? Or is the
thinking more like, "you're lucky that this old non-ODP setup works at
all, and we'll make it work by routing through host/cpu memory, but it
will be slow"?


I think it depends on the user, if the user creates memory which is
permanently located on the GPU then it should be pinnable in this way
without force migration. But if the memory is inherently migratable
then it just cannot be pinned in the GPU at all as we can't
indefinately block migration from happening eg if the CPU touches it
later or something.



OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
necessarily implies or requires migrating to host memory.

thanks,
--
John Hubbard
NVIDIA

Re: [RFC] Make use of non-dynamic dmabuf in RDMA

2021-08-24 Thread John Hubbard

On 8/24/21 2:32 AM, Christian König wrote:

Am 24.08.21 um 11:06 schrieb Gal Pressman:

On 23/08/2021 13:43, Christian König wrote:

Am 21.08.21 um 11:16 schrieb Gal Pressman:

On 20/08/2021 17:32, Jason Gunthorpe wrote:

On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:

...

IIUC, we're talking about three different exporter "types":
- Dynamic with move_notify (requires ODP)
- Dynamic with revoke_notify
- Static

Which changes do we need to make the third one work?

Basically none at all in the framework.

You just need to properly use the dma_buf_pin() function when you start using a
buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
after you are done with the DMA-buf.

I replied to your previous mail, but I'll ask again.
Doesn't the pin operation migrate the memory to host memory?

Sorry missed your previous reply.

And yes at least for the amdgpu driver we migrate the memory to host memory as soon as it is pinned
and I would expect that other GPU drivers do something similar.

Well...for many topologies, migrating to host memory will result in a
dramatically slower p2p setup. For that reason, some GPU drivers may
want to allow pinning of video memory in some situations.

Ideally, you've got modern ODP devices and you don't even need to pin.
But if not, and you still hope to do high performance p2p between a GPU
and a non-ODP Infiniband device, then you would need to leave the pinned
memory in vidmem.

So I think we don't want to rule out that behavior, right? Or is the
thinking more like, "you're lucky that this old non-ODP setup works at
all, and we'll make it work by routing through host/cpu memory, but it
will be slow"?

thanks,
--
John Hubbard
NVIDIA

This is intentional since we don't want any P2P to video memory with pinned objects and want to
avoid to run into a situation where one device is doing P2P to video memory while another device
needs the DMA-buf in host memory.

You can still do P2P with pinned object, it's just up to the exporting driver
if it is allowed or not.

The other option is what Daniel suggested that we have some kind of revoke. This is essentially what
our KFD is doing as well when doing interop with 3D GFX, but from Jasons responses I have a bit of
doubt that this will actually work on the hardware level for RDMA.

Regards,
Christian.

Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-15 Thread John Hubbard


On 8/15/21 8:37 AM, Christoph Hellwig wrote:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ae31622deef..d48a1f0889d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1218,7 +1218,7 @@ __maybe_unused struct page *try_grab_compound_head(struct 
page *page, int refs,
  static inline __must_check bool try_get_page(struct page *page)
  {
page = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+   if (WARN_ON_ONCE(page_ref_count(page) < 
(int)!is_zone_device_page(page)))


Please avoid the overly long line.  In fact I'd be tempted to just not
bother here and keep the old, more lose check.  Especially given that
John has a patch ready that removes try_get_page entirely.



Yes. Andrew has accepted it into mmotm.

Ralph's patch here was written well before my cleanup that removed
try_grab_page() [1]. But now that we're here, if you drop this hunk then
it will make merging easier, I think.


[1] https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-06-03 Thread John Hubbard


On 6/2/21 1:50 AM, Balbir Singh wrote:
...

only impact the address space of programs using the GPU. Should the exclusively
marked range live in the unreclaimable list and recycled back to 
active/in-active
to account for the fact that

1. It is not reclaimable and reclaim will only hurt via page faults?
2. It ages the page correctly or at-least allows for that possibility when the
 page is used by the GPU.


I'm not sure that that is *necessarily* something we can conclude. It depends 
upon
access patterns of each program. For example, a "reduction" parallel program 
sends
over lots of data to the GPU, and only a tiny bit of (reduced!) data comes back
to the CPU. In that case, freeing the physical page on the CPU is actually the
best decision for the OS to make (if the OS is sufficiently prescient).



With a shared device or a device exclusive range, it would be good to get the 
device
usage pattern and update the mm with that knowledge, so that the LRU can be 
better


Integrating a GPU (or "device") processor and it's mm behavior with the Linux 
kernel is
always an interesting concept. Certainly worth exploring, although it's probably
not a small project by any means.


maintained. With your comment you seem to suggest that a page used by the GPU 
might
be a good candidate for reclaim based on the CPU's understanding of the age of
the page should not account for use by the device
(are GPU workloads - access once and discard?)



Well, that's a little too narrow of an interpretation. The GPU is a fairly 
general
purpose processor, and so it has all kinds of workloads. I'm trying to 
discourage
any hopes that one can know, in advance, precisely how the GPU's pages need to 
be
managed. It's similar to the the CPU, in that regard. My example was just one, 
out
of a vast pool of possible behaviors.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-26 Thread John Hubbard


On 5/25/21 4:51 AM, Balbir Singh wrote:
...

How beneficial is this code to nouveau users?  I see that it permits a
part of OpenCL to be implemented, but how useful/important is this in
the real world?


That is a very good question! I've not reviewed the code, but a sample
program with the described use case would make things easy to parse.
I suspect that is not easy to build at the moment?



The cover letter says this:

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic. Without
this series the test fails as there is no way of write-protecting the page
mapping which results in the device clobbering CPU writes. For reference
the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

...so that seems to cover the "sample program" request, at least.


I wonder how we co-ordinate all the work the mm is doing, page migration,
reclaim with device exclusive access? Do we have any numbers for the worst
case page fault latency when something is marked away for exclusive access?


CPU page fault latency is approximately "terrible", if a page is resident on
the GPU. We have to spin up a DMA engine on the GPU and have it copy the page
over the PCIe bus, after all.


I presume for now this is anonymous memory only? SWP_DEVICE_EXCLUSIVE would


Yes, for now.


only impact the address space of programs using the GPU. Should the exclusively
marked range live in the unreclaimable list and recycled back to 
active/in-active
to account for the fact that

1. It is not reclaimable and reclaim will only hurt via page faults?
2. It ages the page correctly or at-least allows for that possibility when the
page is used by the GPU.


I'm not sure that that is *necessarily* something we can conclude. It depends 
upon
access patterns of each program. For example, a "reduction" parallel program 
sends
over lots of data to the GPU, and only a tiny bit of (reduced!) data comes back
to the CPU. In that case, freeing the physical page on the CPU is actually the
best decision for the OS to make (if the OS is sufficiently prescient).

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v9 07/10] mm: Device exclusive memory access

2021-05-24 Thread John Hubbard


On 5/24/21 3:11 PM, Andrew Morton wrote:

...

  Documentation/vm/hmm.rst |  17 
  include/linux/mmu_notifier.h |   6 ++
  include/linux/rmap.h |   4 +
  include/linux/swap.h |   7 +-
  include/linux/swapops.h  |  44 -
  mm/hmm.c |   5 +
  mm/memory.c  | 128 +++-
  mm/mprotect.c|   8 ++
  mm/page_vma_mapped.c |   9 +-
  mm/rmap.c| 186 +++
  10 files changed, 405 insertions(+), 9 deletions(-)



This is quite a lot of code added to core MM for a single driver.

Is there any expectation that other drivers will use this code?


Yes! This should work for GPUs (and potentially, other devices) that support
OpenCL SVM atomic accesses on the device. I haven't looked into how amdgpu
works in any detail, but that's certainly at the top of the list of likely
additional callers.



Is there a way of reducing the impact (code size, at least) for systems
which don't need this code?


I'll leave this question to others for the moment, in order to answer
the "do we need it at all" points.



How beneficial is this code to nouveau users?  I see that it permits a
part of OpenCL to be implemented, but how useful/important is this in
the real world?



So this is interesting. Right now, OpenCL support in Nouveau is rather new
and so probably not a huge impact yet. However, we've built up enough experience
with CUDA and OpenCL to learn that atomic operations, as part of the user
space programming model, are a super big deal. Atomic operations are so
useful and important that I'd expect many OpenCL SVM users to be uninterested in
programming models that lack atomic operations for GPU compute programs.

Again, this doesn't rule out future, non-GPU accelerator devices that may
come along.

Atomic ops are just a really important piece of high-end multi-threaded
programming, it turns out. So this is the beginning of support for an
important building block for general purpose programming on devices that
have GPU-like memory models.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard


On 3/30/21 8:56 PM, John Hubbard wrote:

On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
*
* Called from munlock code.  Checks all of the VMAs mapping the page
* to make sure nobody else has this page mlocked. The page will be
* returned with PG_mlocked cleared if no other vmas have it mlocked.
*/

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

     try_to_munlock() --> try_to_mlock()
     try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.


Actually, re-reading your and Jason's earlier points in the thread, I see
that I'm *not* missing anything, and we are actually in agreement about how
the code operates. OK, good!

Also, as you point out above, maybe the "try_" prefix is not really accurate
either, given how this works. So maybe we have arrived at something like:

try_to_munlock() --> page_mlock() // or mlock_page()...
try_to_munlock_one() --> page_mlock_one()



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard


On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
 * try_to_munlock - try to munlock a page
 * @page: the page to be munlocked
 *
 * Called from munlock code.  Checks all of the VMAs mapping the page
 * to make sure nobody else has this page mlocked. The page will be
 * returned with PG_mlocked cleared if no other vmas have it mlocked.
 */

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

 try_to_munlock() --> try_to_mlock()
 try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.



This is actually inspired from a suggestion in Documentation/vm/unevictable-
lru.rst which warns about this problem:

try_to_munlock() Reverse Map Scan
-

.. warning::
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.



This is actually rather bad advice! page_referenced() returns an
int-that-is-really-a-boolean, whereas try_to_munlock(), at least as it
stands now, returns void. Usually when I'm writing a TODO item, I'm in a
hurry, and I think that's what probably happened here, too. :)



Although, it seems reasonable to tack such renaming patches onto the tail

end

of this series. But whatever works.


Unless anyone objects strongly I will roll the rename into this patch as there
is only one caller of try_to_munlock.

  - Alistair



No objections here. :)

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard


On 3/30/21 3:24 PM, Jason Gunthorpe wrote:
...

As far as I can tell this has always been called try_to_munlock() even though
it appears to do the opposite.


Maybe we should change it then?


/**
  * try_to_munlock - try to munlock a page
  * @page: the page to be munlocked
  *
  * Called from munlock code.  Checks all of the VMAs mapping the page
  * to make sure nobody else has this page mlocked. The page will be
  * returned with PG_mlocked cleared if no other vmas have it mlocked.
  */


In other words it sets PG_mlocked if one or more vmas has it mlocked. So
try_to_mlock() might be a better name, except that seems to have the potential
for confusion as well because it's only called from the munlock code path and
never for mlock.


That explanation makes more sense.. This function looks like it is
'set PG_mlocked of the page if any vm->flags has VM_LOCKED'

Maybe call it check_vm_locked or something then and reword the above
comment?

(and why is it OK to read vm->flags for this without any locking?)


Something needs attention here..


I think the code is correct, but perhaps the naming could be better. Would be
interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
as the current name appears based on the context it is called from (munlock)
rather than what it does (mlock).


The point of this patch is to make it clearer, after all, so I'd
change something and maybe slightly clarify the comment.



I'd add that, after looking around the calling code, this is a really unhappy
pre-existing situation. Anyone reading this has to remember at which point in 
the
call stack the naming transitions from "do the opposite of what the name says",
to "do what the name says".

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.

Although, it seems reasonable to tack such renaming patches onto the tail end
of this series. But whatever works.

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/2] drm/etnaviv: User FOLL_LONGTERM in userptr

2021-03-01 Thread John Hubbard


On 3/1/21 01:52, Daniel Vetter wrote:

There's no mmu notifier or anything like that, releasing this pin is
entirely up to userspace. Hence FOLL_LONGTERM.

No cc: stable for this patch since a lot of the infrastructure around
FOLL_LONGETRM (like not allowing it for pages currently sitting in


  ^FOLL_LONGTERM


ZONE_MOVEABLE before they're migrated) is still being worked on. So
not big benefits yet.


Yes. Great write-up, that's very clear, and it's exactly where we're at.

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA



Cc: John Hubbard 
Signed-off-by: Daniel Vetter 
Cc: Lucas Stach 
Cc: Russell King 
Cc: Christian Gmeiner 
Cc: etna...@lists.freedesktop.org
---
  drivers/gpu/drm/etnaviv/etnaviv_gem.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index a9e696d05b33..db69f19ab5bc 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -689,7 +689,8 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
struct page **pages = pvec + pinned;
  
  		ret = pin_user_pages_fast(ptr, num_pages,

- FOLL_WRITE | FOLL_FORCE, pages);
+ FOLL_WRITE | FOLL_FORCE | 
FOLL_LONGTERM,
+ pages);
if (ret < 0) {
unpin_user_pages(pvec, pinned);
kvfree(pvec);


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-10 Thread John Hubbard


On 2/10/21 4:59 AM, Daniel Vetter wrote:
...

GPU atomic operations to sysmem are hard to categorize, because because 
application
programmers could easily write programs that do a long series of atomic 
operations.
Such a program would be a little weird, but it's hard to rule out.


Yeah, but we can forcefully break this whenever we feel like by revoking
the page, moving it, and then reinstating the gpu pte again and let it
continue.


Oh yes, that's true.



If that's no possible then what we need here instead is an mlock() type of
thing I think.

No need for that, then.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-09 Thread John Hubbard


On 2/9/21 5:37 AM, Daniel Vetter wrote:

On Tue, Feb 9, 2021 at 1:57 PM Alistair Popple  wrote:


On Tuesday, 9 February 2021 9:27:05 PM AEDT Daniel Vetter wrote:


Recent changes to pin_user_pages() prevent the creation of pinned pages in
ZONE_MOVABLE. This series allows pinned pages to be created in

ZONE_MOVABLE

as attempts to migrate may fail which would be fatal to userspace.

In this case migration of the pinned page is unnecessary as the page can

be

unpinned at anytime by having the driver revoke atomic permission as it
does for the migrate_to_ram() callback. However a method of calling this
when memory needs to be moved has yet to be resolved so any discussion is
welcome.


Why do we need to pin for gpu atomics? You still have the callback for
cpu faults, so you
can move the page as needed, and hence a long-term pin sounds like the
wrong approach.


Technically a real long term unmoveable pin isn't required, because as you say
the page can be moved as needed at any time. However I needed some way of
stopping the CPU page from being freed once the userspace mappings for it had
been removed. Obviously I could have just used get_page() but from the
perspective of page migration the result is much the same as a pin - a page
which can't be moved because of the extra refcount.


long term pin vs short term page reference aren't fully fleshed out.
But the rule more or less is:
- short term page reference: _must_ get released in finite time for
migration and other things, either because you have a callback, or
because it's just for direct I/O, which will complete. This means
short term pins will delay migration, but not foul it complete



GPU atomic operations to sysmem are hard to categorize, because because 
application
programmers could easily write programs that do a long series of atomic 
operations.
Such a program would be a little weird, but it's hard to rule out.




- long term pin: the page cannot be moved, all migration must fail.
Also this will have an impact on COW behaviour for fork (but not sure
where those patches are, John Hubbard will know).



That would be Jason's commit 57efa1fe59576 ("mm/gup: prevent gup_fast from 
racing
with COW during fork"), which is in linux-next 20201216.




So I think for your use case here you want a) short term page
reference to make sure it doesn't disappear plus b) callback to make
sure migrate isn't blocked.

Breaking ZONE_MOVEABLE with either allowing long term pins or failing
migrations because you don't release your short term page reference
isn't good.


The normal solution of registering an MMU notifier to unpin the page when it
needs to be moved also doesn't work as the CPU page tables now point to the
device-private page and hence the migration code won't call any invalidate
notifiers for the CPU page.


Yeah you need some other callback for migration on the page directly.
it's a bit awkward since there is one already for struct
address_space, but that's own by the address_space/page cache, not
HMM. So I think we need something else, maybe something for each
ZONE_DEVICE?



This direction sounds at least...possible. Using MMU notifiers instead of pins
is definitely appealing. I'm not quite clear on the callback idea above, but
overall it seems like taking advantage of the ZONE_DEVICE tracking of pages
(without having to put anything additional in each struct page), could work.

Additional notes or ideas here are definitely welcome.



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v16 0/4] RDMA: Add dma-buf support

2021-02-05 Thread John Hubbard


On 2/5/21 7:53 AM, Daniel Vetter wrote:

On Fri, Feb 05, 2021 at 11:43:19AM -0400, Jason Gunthorpe wrote:

On Fri, Feb 05, 2021 at 04:39:47PM +0100, Daniel Vetter wrote:


And again, for slightly older hardware, without pinning to VRAM there is
no way to use this solution here for peer-to-peer. So I'm glad to see that
so far you're not ruling out the pinning option.


Since HMM and ZONE_DEVICE came up, I'm kinda tempted to make ZONE_DEVICE
ZONE_MOVEABLE (at least if you don't have a pinned vram contigent in your
cgroups) or something like that, so we could benefit from the work to make
sure pin_user_pages and all these never end up in there?


ZONE_DEVICE should already not be returned from GUP.

I've understood in the hmm casse the idea was a CPU touch of some
ZONE_DEVICE pages would trigger a migration to CPU memory, GUP would
want to follow the same logic, presumably it comes for free with the
fault handler somehow


Oh I didn't know this, I thought the proposed p2p direct i/o patches would
just use the fact that underneath ZONE_DEVICE there's "normal" struct
pages. And so I got worried that maybe also pin_user_pages can creep in.
But I didn't read the patches in full detail:

https://lore.kernel.org/linux-block/20201106170036.18713-12-log...@deltatee.com/

But if you're saying that this all needs specific code and all the gup/pup
code we have is excluded, I think we can make sure that we're not ever
building features that requiring time-unlimited pinning of ZONE_DEVICE.
Which I think we want.



From an HMM perspective, the above sounds about right. HMM relies on the
GPU/device memory being ZONE_DEVICE, *and* on that memory *not* being pinned.
(HMM's mmu notifier callbacks act as a sort of virtual pin, but not a refcount
pin.)

It's a nice clean design point that we need to preserve, and fortunately it
doesn't conflict with anything I'm seeing here. But I want to say this out
loud because I see some doubt about it creeping into the discussion.

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH rdma-core 2/3] pyverbs,tests: Cosmetic improvements for dma-buf allocation routines

2021-02-04 Thread John Hubbard


On 2/4/21 10:50 AM, Jianxin Xiong wrote:

Rename the parameter 'unit' to 'gpu'. Expand GTT to the full name in the
comments.

Signed-off-by: Jianxin Xiong 
---
  pyverbs/dmabuf.pyx | 12 
  pyverbs/dmabuf_alloc.c | 12 
  pyverbs/dmabuf_alloc.h |  2 +-
  pyverbs/mr.pyx |  6 ++--
  tests/test_mr.py   | 78 +-
  5 files changed, 55 insertions(+), 55 deletions(-)



Looks good!

If you care, you might want to add a space, like this, to the few GTT cases:

GTT (Graphics Translation Table)

Obviously not worth spinning another version for that, as it is still readable
as-is. Just mentioning it for the sake of pointless perfectionism, and in case
someone ever wonders why it was missed during a review. :) Either way, feel free
to add:

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA



diff --git a/pyverbs/dmabuf.pyx b/pyverbs/dmabuf.pyx
index b9406bd..9ed7f02 100644
--- a/pyverbs/dmabuf.pyx
+++ b/pyverbs/dmabuf.pyx
@@ -12,7 +12,7 @@ from pyverbs.mr cimport DmaBufMR
  cdef extern from "dmabuf_alloc.h":
  cdef struct dmabuf:
  pass
-dmabuf *dmabuf_alloc(unsigned long size, int unit, int gtt)
+dmabuf *dmabuf_alloc(unsigned long size, int gpu, int gtt)
  void dmabuf_free(dmabuf *dmabuf)
  int dmabuf_get_drm_fd(dmabuf *dmabuf)
  int dmabuf_get_fd(dmabuf *dmabuf)
@@ -20,20 +20,20 @@ cdef extern from "dmabuf_alloc.h":
  
  
  cdef class DmaBuf:

-def __init__(self, size, unit=0, gtt=0):
+def __init__(self, size, gpu=0, gtt=0):
  """
  Allocate DmaBuf object from a GPU device. This is done through the
  DRI device interface. Usually this requires the effective user id
  being a member of the 'render' group.
  :param size: The size (in number of bytes) of the buffer.
-:param unit: The unit number of the GPU to allocate the buffer from.
-:param gtt: Allocate from GTT instead of VRAM.
+:param gpu: The GPU unit to allocate the buffer from.
+:param gtt: Allocate from GTT(Graphics Translation Table) instead of 
VRAM.
  :return: The newly created DmaBuf object on success.
  """
  self.dmabuf_mrs = weakref.WeakSet()
-self.dmabuf = dmabuf_alloc(size, unit, gtt)
+self.dmabuf = dmabuf_alloc(size, gpu, gtt)
  if self.dmabuf == NULL:
-raise PyverbsRDMAErrno(f'Failed to allocate dmabuf of size {size} 
on unit {unit}')
+raise PyverbsRDMAErrno(f'Failed to allocate dmabuf of size {size} 
on gpu {gpu}')
  self.drm_fd = dmabuf_get_drm_fd(self.dmabuf)
  self.fd = dmabuf_get_fd(self.dmabuf)
  self.map_offset = dmabuf_get_offset(self.dmabuf)
diff --git a/pyverbs/dmabuf_alloc.c b/pyverbs/dmabuf_alloc.c
index 05eae75..93267bf 100644
--- a/pyverbs/dmabuf_alloc.c
+++ b/pyverbs/dmabuf_alloc.c
@@ -95,7 +95,7 @@ static int amdgpu_mmap_offset(struct drm *drm, uint32_t 
handle,
return 0;
  }
  
-static struct drm *drm_open(int unit)

+static struct drm *drm_open(int gpu)
  {
char path[32];
struct drm_version version = {};
@@ -107,7 +107,7 @@ static struct drm *drm_open(int unit)
if (!drm)
return NULL;
  
-	snprintf(path, sizeof(path), "/dev/dri/renderD%d", unit + 128);

+   snprintf(path, sizeof(path), "/dev/dri/renderD%d", gpu + 128);
  
  	drm->fd = open(path, O_RDWR);

if (drm->fd < 0)
@@ -204,10 +204,10 @@ struct dmabuf {
  /*
   * dmabuf_alloc - allocate a dmabuf from GPU
   * @size - byte size of the buffer to allocate
- * @unit - the GPU unit to use
- * @gtt - if true, allocate from GTT instead of VRAM
+ * @gpu - the GPU unit to use
+ * @gtt - if true, allocate from GTT(Graphics Translation Table) instead of 
VRAM
   */
-struct dmabuf *dmabuf_alloc(uint64_t size, int unit, int gtt)
+struct dmabuf *dmabuf_alloc(uint64_t size, int gpu, int gtt)
  {
struct dmabuf *dmabuf;
int err;
@@ -216,7 +216,7 @@ struct dmabuf *dmabuf_alloc(uint64_t size, int unit, int 
gtt)
if (!dmabuf)
return NULL;
  
-	dmabuf->drm = drm_open(unit);

+   dmabuf->drm = drm_open(gpu);
if (!dmabuf->drm)
goto out_free;
  
diff --git a/pyverbs/dmabuf_alloc.h b/pyverbs/dmabuf_alloc.h

index f1b03c5..4698b11 100644
--- a/pyverbs/dmabuf_alloc.h
+++ b/pyverbs/dmabuf_alloc.h
@@ -10,7 +10,7 @@
  
  struct dmabuf;
  
-struct dmabuf *dmabuf_alloc(uint64_t size, int unit, int gtt);

+struct dmabuf *dmabuf_alloc(uint64_t size, int gpu, int gtt);
  void dmabuf_free(struct dmabuf *dmabuf);
  int dmabuf_get_drm_fd(struct dmabuf *dmabuf);
  int dmabuf_get_fd(struct dmabuf *dmabuf);
diff --git a/pyverbs/mr.pyx b/pyverbs/mr.pyx
index aad47e2..d05d044 100644
--- a/pyverbs/mr.pyx
+++ b/pyverbs/mr.pyx
@@ -384,7 +384,7 @@ cdef class DMMR(MR):
  
  cdef class DmaBufMR(MR):

  def __ini

Re: [PATCH v16 0/4] RDMA: Add dma-buf support

2021-02-04 Thread John Hubbard


On 2/4/21 10:44 AM, Alex Deucher wrote:
...

The argument is that vram is a scarce resource, but I don't know if
that is really the case these days.  At this point, we often have as
much vram as system ram if not more.


I thought the main argument was that GPU memory could move at any time
between the GPU and CPU and the DMA buf would always track its current
location?


I think the reason for that is that VRAM is scarce so we have to be
able to move it around.  We don't enforce the same limitations for
buffers in system memory.  We could just support pinning dma-bufs in
vram like we do with system ram.  Maybe with some conditions, e.g.,
p2p is possible, and the device has a large BAR so you aren't tying up
the BAR window.



Excellent. And yes, we are already building systems in which VRAM is
definitely not scarce, but on the other hand, those newer systems can
also handle GPU (and NIC) page faults, so not really an issue. For that,
we just need to enhance HMM so that it does peer to peer.

We also have some older hardware with large BAR1 apertures, specifically
for this sort of thing.

And again, for slightly older hardware, without pinning to VRAM there is
no way to use this solution here for peer-to-peer. So I'm glad to see that
so far you're not ruling out the pinning option.



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v16 0/4] RDMA: Add dma-buf support

2021-02-03 Thread John Hubbard


On 12/15/20 1:27 PM, Jianxin Xiong wrote:

This patch series adds dma-buf importer role to the RDMA driver in
attempt to support RDMA using device memory such as GPU VRAM. Dma-buf is
chosen for a few reasons: first, the API is relatively simple and allows
a lot of flexibility in implementing the buffer manipulation ops.
Second, it doesn't require page structure. Third, dma-buf is already
supported in many GPU drivers. However, we are aware that existing GPU
drivers don't allow pinning device memory via the dma-buf interface.
Pinning would simply cause the backing storage to migrate to system RAM.
True peer-to-peer access is only possible using dynamic attach, which
requires on-demand paging support from the NIC to work. For this reason,
this series only works with ODP capable NICs.


Hi,

Looking ahead to after this patchset is merged...

Are there design thoughts out there, about the future of pinning to vidmem,
for this? It would allow a huge group of older GPUs and NICs and such to
do p2p with this approach, and it seems like a natural next step, right?


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH rdma-core v7 5/6] tests: Add tests for dma-buf based memory regions

2021-01-31 Thread John Hubbard


On 1/25/21 11:57 AM, Jianxin Xiong wrote:

Define a set of unit tests similar to regular MR tests and a set of
tests for send/recv and rdma traffic using dma-buf MRs. Add a utility
function to generate access flags for dma-buf based MRs because the
set of supported flags is smaller.

Signed-off-by: Jianxin Xiong 


Hi Jianxin,

It's awesome to see a GPU to IB test suite here!


---
  tests/args_parser.py |   4 +
  tests/test_mr.py | 266 ++-
  tests/utils.py   |  26 +
  3 files changed, 295 insertions(+), 1 deletion(-)

diff --git a/tests/args_parser.py b/tests/args_parser.py
index 446535a..5bc53b0 100644
--- a/tests/args_parser.py
+++ b/tests/args_parser.py
@@ -19,6 +19,10 @@ class ArgsParser(object):
  parser.add_argument('--port',
  help='Use port  of RDMA device', type=int,
  default=1)
+parser.add_argument('--gpu', nargs='?', type=int, const=0, default=0,
+help='GPU unit to allocate dmabuf from')
+parser.add_argument('--gtt', action='store_true', default=False,
+help='Allocate dmabuf from GTT instead of VRAM')


Just to be kind to non-GPU people, how about:

s/GTT/GTT (Graphics Translation Table)/



  parser.add_argument('-v', '--verbose', dest='verbosity',
  action='store_const',
  const=2, help='Verbose output')
diff --git a/tests/test_mr.py b/tests/test_mr.py
index b88ad23..03a645f 100644
--- a/tests/test_mr.py
+++ b/tests/test_mr.py
@@ -1,5 +1,6 @@
  # SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
  # Copyright (c) 2019 Mellanox Technologies, Inc. All rights reserved. See 
COPYING file
+# Copyright (c) 2020 Intel Corporation. All rights reserved. See COPYING file
  """
  Test module for pyverbs' mr module.
  """
@@ -9,15 +10,18 @@ import errno
  
  from tests.base import PyverbsAPITestCase, RCResources, RDMATestCase

  from pyverbs.pyverbs_error import PyverbsRDMAError, PyverbsError
-from pyverbs.mr import MR, MW, DMMR, MWBindInfo, MWBind
+from pyverbs.mr import MR, MW, DMMR, DmaBufMR, MWBindInfo, MWBind
  from pyverbs.qp import QPCap, QPInitAttr, QPAttr, QP
  from pyverbs.mem_alloc import posix_memalign, free
  from pyverbs.wr import SendWR
+from pyverbs.dmabuf import DmaBuf
  import pyverbs.device as d
  from pyverbs.pd import PD
  import pyverbs.enums as e
  import tests.utils as u
  
+MAX_IO_LEN = 1048576

+
  
  class MRRes(RCResources):

  def __init__(self, dev_name, ib_port, gid_index,
@@ -423,3 +427,263 @@ class DMMRTest(PyverbsAPITestCase):
  dm_mr = DMMR(pd, dm_mr_len, e.IBV_ACCESS_ZERO_BASED,
   dm=dm, offset=dm_mr_offset)
  dm_mr.close()
+
+
+def check_dmabuf_support(unit=0):
+"""
+Check if dma-buf allocation is supported by the system.
+Skip the test on failure.
+"""
+device_num = 128 + unit
+try:
+DmaBuf(1, unit=unit)


unit?? This is a GPU, never anything else! Let's s/unit/gpu/ throughout, yes?

thanks,
--
John Hubbard
NVIDIA


+except PyverbsRDMAError as ex:
+if ex.error_code == errno.ENOENT:
+raise unittest.SkipTest(f'Device /dev/dri/renderD{device_num} is 
not present')
+if ex.error_code == errno.EACCES:
+raise unittest.SkipTest(f'Lack of permission to access 
/dev/dri/renderD{device_num}')
+if ex.error_code == errno.EOPNOTSUPP:
+raise unittest.SkipTest(f'Allocating dmabuf is not supported by 
/dev/dri/renderD{device_num}')
+
+
+def check_dmabuf_mr_support(pd, unit=0):
+"""
+Check if dma-buf MR registration is supported by the driver.
+Skip the test on failure
+"""
+try:
+DmaBufMR(pd, 1, 0, unit=unit)
+except PyverbsRDMAError as ex:
+if ex.error_code == errno.EOPNOTSUPP:
+raise unittest.SkipTest('Reg dma-buf MR is not supported by the 
RDMA driver')
+
+
+class DmaBufMRTest(PyverbsAPITestCase):
+"""
+Test various functionalities of the DmaBufMR class.
+"""
+def setUp(self):
+super().setUp()
+self.unit = self.config['gpu']
+self.gtt = self.config['gtt']
+
+def test_dmabuf_reg_mr(self):
+"""
+Test ibv_reg_dmabuf_mr()
+"""
+check_dmabuf_support(self.unit)
+for ctx, attr, attr_ex in self.devices:
+with PD(ctx) as pd:
+check_dmabuf_mr_support(pd, self.unit)
+flags = u.get_dmabuf_access_flags(ctx)
+for f in flags:
+len = u.get_mr_length()
+for off in [0, len//2]:
+with DmaBufMR(pd, len, f, offset=off, unit=self.unit,
+

Re: [PATCH rdma-core 5/5] tests: Bug fix for get_access_flags()

2020-11-24 Thread John Hubbard


Just some silly nits I stumbled across while trying to understand the tests.

On 11/23/20 9:53 AM, Jianxin Xiong wrote:

The filter defintion is wrong and causes get_access_flags() always


 definition


returning empty list. As the result the MR tests using this function
are effectively skipped (but report success).

Also fix a typo in the comments.


Was there another typo somewhere? All I see is an *added* typo...



Signed-off-by: Jianxin Xiong 
---
  tests/utils.py | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/utils.py b/tests/utils.py
index 0ad7110..eee44b4 100644
--- a/tests/utils.py
+++ b/tests/utils.py
@@ -55,8 +55,8 @@ def filter_illegal_access_flags(element):
  :param element: A list of access flags to check
  :return: True if this list is legal, else False
  """
-if e.IBV_ACCESS_REMOTE_ATOMIC in element or e.IBV_ACCESS_REMOTE_WRITE:
-if e.IBV_ACCESS_LOCAL_WRITE:
+if e.IBV_ACCESS_REMOTE_ATOMIC in element or e.IBV_ACCESS_REMOTE_WRITE in 
element:
+if not e.IBV_ACCESS_LOCAL_WRITE in element:
  return False
  return True
  
@@ -69,7 +69,7 @@ def get_access_flags(ctx):

  added as well.
  After verifying that the flags selection is legal, it is appended to an
  array, assuming it wasn't previously appended.
-:param ctx: Device Context to check capabilities
+:param ctx: Device Coyyntext to check capabilities


I liked the old spelling. "Coyyntext" just doesn't sound as good. :)


  :param num: Size of initial collection
  :return: A random legal value for MR flags
      """



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

2020-11-05 Thread John Hubbard


On 11/5/20 4:49 AM, Jason Gunthorpe wrote:

On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:

/*
  * If we can't determine whether or not a pte is special, then fail immediately
  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
  * to be special.
  *
  * For a futex to be placed on a THP tail page, get_futex_key requires a
  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
  * useful to have gup_huge_pmd even if we can't operate on ptes.
  */


We support hugepage faults in gpu drivers since recently, and I'm not
seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
just me missing something again.


It means ioremap can't create an IO page PUD, it has to be broken up.

Does ioremap even create anything larger than PTEs?



From my reading, yes. See ioremap_try_huge_pmd().

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

2020-11-04 Thread John Hubbard


On 11/4/20 10:17 AM, Jason Gunthorpe wrote:

On Wed, Nov 04, 2020 at 04:41:19PM +, Christoph Hellwig wrote:

On Wed, Nov 04, 2020 at 04:37:58PM +, Christoph Hellwig wrote:

On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:

What we're discussing is whether gup_fast and pup_fast also obey this,
or fall over and can give you the struct page that's backing the
dma_mmap_* memory. Since the _fast variant doesn't check for
vma->vm_flags, and afaict that's the only thing which closes this gap.
And like you restate, that would be a bit a problem. So where's that
check which Jason aren't spotting?


remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
errors out on pte_special.  Of course this only works for the
CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
a real problem.


Except that we don't really support pte-level gup-fast without
CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.


Mm, I thought it was probably the special flag..

Knowing that CONFIG_HAVE_FAST_GUP can't be set without
CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
the Kconfig?

config HAVE_FAST_GUP
 depends on MMU
 depends on ARCH_HAS_PTE_SPECIAL
 bool


Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
gup-fast is not *completely* unavailable there, so I don't think you want
to shut it off like that:

/*
 * If we can't determine whether or not a pte is special, then fail immediately
 * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
 * to be special.
 *
 * For a futex to be placed on a THP tail page, get_futex_key requires a
 * get_user_pages_fast_only implementation that can pin pages. Thus it's still
 * useful to have gup_huge_pmd even if we can't operate on ptes.
 */


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [resource] 22b17dc667: Kernel panic - not syncing: Fatal exception

2020-11-02 Thread John Hubbard

  0010 DS:  ES:  CR0: 80050033
[   29.017409] CR2:  CR3: 000100120003 CR4: 007706f0
[   29.024539] DR0:  DR1:  DR2: 
[   29.031671] DR3:  DR6: fffe0ff0 DR7: 0400
[   29.038804] PKRU: 5554
[   29.041508] Kernel panic - not syncing: Fatal exception
ACPI MEMORY or I/O RESET_REG.


To reproduce:

 git clone https://github.com/intel/lkp-tests.git
 cd lkp-tests
 bin/lkp install job.yaml  # job file is attached in this email
 bin/lkp run job.yaml



Thanks,
oliver.s...@intel.com



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

2020-11-01 Thread John Hubbard


On 11/1/20 2:30 AM, Daniel Vetter wrote:

On Sun, Nov 1, 2020 at 6:22 AM John Hubbard  wrote:


On 10/31/20 7:45 AM, Daniel Vetter wrote:

On Sat, Oct 31, 2020 at 3:55 AM John Hubbard  wrote:

On 10/30/20 3:08 AM, Daniel Vetter wrote:

...

By removing this check from this location, and changing from
pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
losing the check entirely. Is that intended? If so it could use a comment
somewhere to explain why.


Yeah this wasn't intentional. I think I needed to drop the _locked
version to prep for FOLL_LONGTERM, and figured _fast is always better.
But I didn't realize that _fast doesn't have the vma checks, gup.c got
me a bit confused.


Actually, I thought that the change to _fast was a very nice touch, btw.



I'll remedy this in all the patches where this applies (because a
VM_IO | VM_PFNMAP can point at struct page backed memory, and that
exact use-case is what we want to stop with the unsafe_follow_pfn work
since it wreaks things like cma or security).

Aside: I do wonder whether the lack for that check isn't a problem.
VM_IO | VM_PFNMAP generally means driver managed, which means the
driver isn't going to consult the page pin count or anything like that
(at least not necessarily) when revoking or moving that memory, since
we're assuming it's totally under driver control. So if pup_fast can
get into such a mapping, we might have a problem.
-Daniel



Yes. I don't know why that check is missing from the _fast path.
Probably just an oversight, seeing as how it's in the slow path. Maybe
the appropriate response here is to add a separate patch that adds the
check.

I wonder if I'm overlooking something, but it certainly seems correct to
do that.


You'll need the mmap_sem to get at the vma to be able to do this
check. If you add that to _fast, you made it as fast as the slow one.


Arggh, yes of course. Strike that, please. :)


Plus there's _fast_only due to locking recurion issues in fast-paths
(I assume, I didn't check all the callers).

I'm just wondering whether we have a bug somewhere with device
drivers. For CMA regions we always check in try_grab_page, but for dax


OK, so here you're talking about a different bug than the VM_IO | VM_PFNMAP
pages, I think. This is about the "FOLL_LONGTERM + CMA + gup/pup _fast"
combination that is not allowed, right?

For that: try_grab_page() doesn't check anything, but try_grab_compound_head()
does, but only for pup_fast, not gup_fast. That was added by commit
df3a0a21b698d ("mm/gup: fix omission of check on FOLL_LONGTERM in gup fast
path") in April.

I recall that the patch was just plugging a very specific hole, as opposed
to locking down the API against mistakes or confused callers. And it does
seem that there are some holes.


I'm not seeing where the checks in the _fast fastpaths are, and that
all still leaves random device driver mappings behind which aren't
backed by CMA but still point to something with a struct page behind
it. I'm probably just missing something, but no idea what.
-Daniel



Certainly we've established that we can't check VMA flags by that time,
so I'm not sure that there is much we can check by the time we get to
gup/pup _fast. Seems like the device drivers have to avoid calling _fast
with pages that live in VM_IO | VM_PFNMAP, by design, right? Or maybe
you're talking about CMA checks only?


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

2020-10-31 Thread John Hubbard


On 10/31/20 7:45 AM, Daniel Vetter wrote:

On Sat, Oct 31, 2020 at 3:55 AM John Hubbard  wrote:

On 10/30/20 3:08 AM, Daniel Vetter wrote:

...

By removing this check from this location, and changing from
pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
losing the check entirely. Is that intended? If so it could use a comment
somewhere to explain why.


Yeah this wasn't intentional. I think I needed to drop the _locked
version to prep for FOLL_LONGTERM, and figured _fast is always better.
But I didn't realize that _fast doesn't have the vma checks, gup.c got
me a bit confused.


Actually, I thought that the change to _fast was a very nice touch, btw.



I'll remedy this in all the patches where this applies (because a
VM_IO | VM_PFNMAP can point at struct page backed memory, and that
exact use-case is what we want to stop with the unsafe_follow_pfn work
since it wreaks things like cma or security).

Aside: I do wonder whether the lack for that check isn't a problem.
VM_IO | VM_PFNMAP generally means driver managed, which means the
driver isn't going to consult the page pin count or anything like that
(at least not necessarily) when revoking or moving that memory, since
we're assuming it's totally under driver control. So if pup_fast can
get into such a mapping, we might have a problem.
-Daniel



Yes. I don't know why that check is missing from the _fast path.
Probably just an oversight, seeing as how it's in the slow path. Maybe
the appropriate response here is to add a separate patch that adds the
check.

I wonder if I'm overlooking something, but it certainly seems correct to
do that.

 thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 13/15] resource: Move devmem revoke code to resource framework

2020-10-31 Thread John Hubbard


On 10/30/20 3:08 AM, Daniel Vetter wrote:

We want all iomem mmaps to consistently revoke ptes when the kernel
takes over and CONFIG_IO_STRICT_DEVMEM is enabled. This includes the
pci bar mmaps available through procfs and sysfs, which currently do
not revoke mappings.

To prepare for this, move the code from the /dev/kmem driver to
kernel/resource.c.


This seems like it's doing a lot more than just code movement, right?
Should we list some of that here?

Also, I'm seeing a crash due to this commit. More below:



Reviewed-by: Greg Kroah-Hartman 
Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Kees Cook 
Cc: Dan Williams 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Daniel Vetter 
Cc: David Hildenbrand 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Daniel Vetter 
--
v3:
- add barrier for consistency and document why we don't have to check
   for NULL (Jason)
v4
- Adjust comments to reflect the general nature of this iomem revoke
   code now (Dan)
---
  drivers/char/mem.c |  85 +-
  include/linux/ioport.h |   6 +--
  kernel/resource.c  | 101 -
  3 files changed, 102 insertions(+), 90 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 7dcf9e4ea79d..43c871dc7477 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -31,9 +31,6 @@
  #include 
  #include 
  #include 
-#include 
-#include 
-#include 
  
  #ifdef CONFIG_IA64

  # include 
@@ -836,42 +833,6 @@ static loff_t memory_lseek(struct file *file, loff_t 
offset, int orig)
return ret;
  }
  
-static struct inode *devmem_inode;

-
-#ifdef CONFIG_IO_STRICT_DEVMEM
-void revoke_devmem(struct resource *res)
-{
-   /* pairs with smp_store_release() in devmem_init_inode() */
-   struct inode *inode = smp_load_acquire(_inode);
-
-   /*
-* Check that the initialization has completed. Losing the race
-* is ok because it means drivers are claiming resources before
-* the fs_initcall level of init and prevent /dev/mem from
-* establishing mappings.
-*/
-   if (!inode)
-   return;
-
-   /*
-* The expectation is that the driver has successfully marked
-* the resource busy by this point, so devmem_is_allowed()
-* should start returning false, however for performance this
-* does not iterate the entire resource range.
-*/
-   if (devmem_is_allowed(PHYS_PFN(res->start)) &&
-   devmem_is_allowed(PHYS_PFN(res->end))) {
-   /*
-* *cringe* iomem=relaxed says "go ahead, what's the
-* worst that can happen?"
-*/
-   return;
-   }
-
-   unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 
1);
-}
-#endif
-
  static int open_port(struct inode *inode, struct file *filp)
  {
int rc;
@@ -891,7 +852,7 @@ static int open_port(struct inode *inode, struct file *filp)
 * revocations when drivers want to take over a /dev/mem mapped
 * range.
 */
-   filp->f_mapping = inode->i_mapping;
+   filp->f_mapping = iomem_get_mapping();



The problem is that iomem_get_mapping() returns NULL for the 
!CONFIG_IO_STRICT_DEVMEM
case. And then we have pre-existing fs code that expects to go "up and over", 
like this:


static int do_dentry_open(struct file *f,
  struct inode *inode,
  int (*open)(struct inode *, struct file *))
{
...

file_ra_state_init(>f_ra, f->f_mapping->host->i_mapping);

...and it crashes on that line fairly early in bootup.

Not sure what to suggest for this patch, but wanted to get this report out at 
least.

thanks,
--
John Hubbard
NVIDIA

  
  	return 0;

  }
@@ -1023,48 +984,6 @@ static char *mem_devnode(struct device *dev, umode_t 
*mode)
  
  static struct class *mem_class;
  
-static int devmem_fs_init_fs_context(struct fs_context *fc)

-{
-   return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
-}
-
-static struct file_system_type devmem_fs_type = {
-   .name   = "devmem",
-   .owner  = THIS_MODULE,
-   .init_fs_context = devmem_fs_init_fs_context,
-   .kill_sb= kill_anon_super,
-};
-
-static int devmem_init_inode(void)
-{
-   static struct vfsmount *devmem_vfs_mount;
-   static int devmem_fs_cnt;
-   struct inode *inode;
-   int rc;
-
-   rc = simple_pin_fs(_fs_type, _vfs_mount, _fs_cnt);
-   if (rc < 0) {
-   pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc);
-   return rc;
-   }
-
-   inode = alloc_anon_in

Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

2020-10-30 Thread John Hubbard


On 10/30/20 3:08 AM, Daniel Vetter wrote:

This is used by media/videbuf2 for persistent dma mappings, not just
for a single dma operation and then freed again, so needs
FOLL_LONGTERM.

Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
locking issues. Rework the code to pull the pup path out from the
mmap_sem critical section as suggested by Jason.

By relying entirely on the vma checks in pin_user_pages and follow_pfn


There are vma checks in pin_user_pages(), but this patch changes things
to call pin_user_pages_fast(). And that does not have the vma checks.
More below about this:


(for vm_flags and vma_is_fsdax) we can also streamline the code a lot.

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Pawel Osciak 
Cc: Marek Szyprowski 
Cc: Kyungmin Park 
Cc: Tomasz Figa 
Cc: Mauro Carvalho Chehab 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Signed-off-by: Daniel Vetter 
--
v2: Streamline the code and further simplify the loop checks (Jason)

v5: Review from Tomasz:
- fix page counting for the follow_pfn case by resetting ret
- drop gup_flags paramater, now unused
---
  .../media/common/videobuf2/videobuf2-memops.c |  3 +-
  include/linux/mm.h|  2 +-
  mm/frame_vector.c | 53 ++-
  3 files changed, 19 insertions(+), 39 deletions(-)

diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c 
b/drivers/media/common/videobuf2/videobuf2-memops.c
index 6e9e05153f4e..9dd6c27162f4 100644
--- a/drivers/media/common/videobuf2/videobuf2-memops.c
+++ b/drivers/media/common/videobuf2/videobuf2-memops.c
@@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
unsigned long first, last;
unsigned long nr;
struct frame_vector *vec;
-   unsigned int flags = FOLL_FORCE | FOLL_WRITE;
  
  	first = start >> PAGE_SHIFT;

last = (start + length - 1) >> PAGE_SHIFT;
@@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
vec = frame_vector_create(nr);
if (!vec)
return ERR_PTR(-ENOMEM);
-   ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
+   ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
if (ret < 0)
goto out_destroy;
/* We accept only complete set of PFNs */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..d6b8e30dce2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1765,7 +1765,7 @@ struct frame_vector {
  struct frame_vector *frame_vector_create(unsigned int nr_frames);
  void frame_vector_destroy(struct frame_vector *vec);
  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
-unsigned int gup_flags, struct frame_vector *vec);
+struct frame_vector *vec);
  void put_vaddr_frames(struct frame_vector *vec);
  int frame_vector_to_pages(struct frame_vector *vec);
  void frame_vector_to_pfns(struct frame_vector *vec);
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 10f82d5643b6..f8c34b895c76 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -32,13 +32,12 @@
   * This function takes care of grabbing mmap_lock as necessary.
   */
  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
-unsigned int gup_flags, struct frame_vector *vec)
+struct frame_vector *vec)
  {
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int ret = 0;
int err;
-   int locked;
  
  	if (nr_frames == 0)

return 0;
@@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
  
  	start = untagged_addr(start);
  
-	mmap_read_lock(mm);

-   locked = 1;
-   vma = find_vma_intersection(mm, start, start + 1);
-   if (!vma) {
-   ret = -EFAULT;
-   goto out;
-   }
-
-   /*
-* While get_vaddr_frames() could be used for transient (kernel
-* controlled lifetime) pinning of memory pages all current
-* users establish long term (userspace controlled lifetime)
-* page pinning. Treat get_vaddr_frames() like
-* get_user_pages_longterm() and disallow it for filesystem-dax
-* mappings.
-*/
-   if (vma_is_fsdax(vma)) {
-   ret = -EOPNOTSUPP;
-   goto out;
-   }
-
-   if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {


By removing this check from this location, and changing from
pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
losing the check entirely. Is that intended? If so it could use a comment
somewhere to explain why.

thanks,
--
John Hubbard
NVIDIA


+   ret = pin

Re: [PATCH v5 0/5] RDMA: Add dma-buf support

2020-10-20 Thread John Hubbard


On 10/15/20 3:02 PM, Jianxin Xiong wrote:

This is the fifth version of the patch set. Changelog:



Hi,

A minor point, but if you can tweak your email sending setup, it would be nice.
Specifically, make follow-up patches a reply to the first item. That's a list
convention, and git format-patch + git send-email *.patch is normally 
sufficient to
make that happen, unless you override it by doing something like sending each
patch separately...which is my first suspicion as to how this happened.

These patches are difficult to link to, because they don't follow the convention
of patches 1-5 being in-reply-to patch 0. So if we want to ask people outside
of this list to take a peek (I was about to), we have to go collect 5 or 6
different lore.kernel.org URLs, one for each patch...

Take a look on lore and you can see the problem. Here's patch 0, and there is
no way from there to find the remaining patches:

   
https://lore.kernel.org/dri-devel/1602799340-138152-1-git-send-email-jianxin.xi...@intel.com/


thanks,
--
John Hubbard
NVIDIA


v5:
* Fix a few warnings reported by kernel test robot:
 - no previous prototype for function 'ib_umem_dmabuf_release'
 - no previous prototype for function 'ib_umem_dmabuf_map_pages'
 - comparison of distinct pointer types in 'check_add_overflow'
* Add comment for the wait between getting the dma-buf sg tagle and
   updating the NIC page table

v4: https://www.spinics.net/lists/linux-rdma/msg96767.html
* Add a new ib_device method reg_user_mr_dmabuf() instead of expanding
   the existing method reg_user_mr()
* Use a separate code flow for dma-buf instead of adding special cases
   to the ODP memory region code path
* In invalidation callback, new mapping is updated as whole using work
   queue instead of being updated in page granularity in the page fault
   handler
* Use dma_resv_get_excl() and dma_fence_wait() to ensure the content of
   the pages have been moved to the new location before the new mapping
   is programmed into the NIC
* Add code to the ODP page fault handler to check the mapping status
* The new access flag added in v3 is removed.
* The checking for on-demand paging support in the new uverbs command
   is removed because it is implied by implementing the new ib_device
   method
* Clarify that dma-buf sg lists are page aligned

v3: https://www.spinics.net/lists/linux-rdma/msg96330.html
* Use dma_buf_dynamic_attach() instead of dma_buf_attach()
* Use on-demand paging mechanism to avoid pinning the GPU memory
* Instead of adding a new parameter to the device method for memory
   registration, pass all the attributes including the file descriptor
   as a structure
* Define a new access flag for dma-buf based memory region
* Check for on-demand paging support in the new uverbs command

v2: https://www.spinics.net/lists/linux-rdma/msg93643.html
* The Kconfig option is removed. There is no dependence issue since
   dma-buf driver is always enabled.
* The declaration of new data structure and functions is reorganized to
   minimize the visibility of the changes.
* The new uverbs command now goes through ioctl() instead of write().
* The rereg functionality is removed.
* Instead of adding new device method for dma-buf specific registration,
   existing method is extended to accept an extra parameter.
* The correct function is now used for address range checking.

v1: https://www.spinics.net/lists/linux-rdma/msg90720.html
* The initial patch set
* Implement core functions for importing and mapping dma-buf
* Use dma-buf static attach interface
* Add two ib_device methods reg_user_mr_fd() and rereg_user_mr_fd()
* Add two uverbs commands via the write() interface
* Add Kconfig option
* Add dma-buf support to mlx5 device

When enabled, an RDMA capable NIC can perform peer-to-peer transactions
over PCIe to access the local memory located on another device. This can
often lead to better performance than using a system memory buffer for
RDMA and copying data between the buffer and device memory.

Current kernel RDMA stack uses get_user_pages() to pin the physical
pages backing the user buffer and uses dma_map_sg_attrs() to get the
dma addresses for memory access. This usually doesn't work for peer
device memory due to the lack of associated page structures.

Several mechanisms exist today to facilitate device memory access.

ZONE_DEVICE is a new zone for device memory in the memory management
subsystem. It allows pages from device memory being described with
specialized page structures, but what can be done with these page
structures may be different from system memory. ZONE_DEVICE is further
specialized into multiple memory types, such as one type for PCI
p2pmem/p2pdma and one type for HMM.

PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing
in a PCI BAR and provides a set of calls to publish, discover, allocate,
and map such memory for peer-to-peer transactions. One feature of the
API is that the buffer is allocated by the side that does

Re: [PATCH v2 05/17] mm/frame-vector: Use FOLL_LONGTERM

2020-10-16 Thread John Hubbard


On 10/9/20 12:59 AM, Daniel Vetter wrote:
...

@@ -48,40 +47,25 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
  
  	start = untagged_addr(start);
  
-	mmap_read_lock(mm);

-   locked = 1;
-   vma = find_vma_intersection(mm, start, start + 1);
-   if (!vma) {
-   ret = -EFAULT;
-   goto out;
-   }
-
-   /*
-* While get_vaddr_frames() could be used for transient (kernel
-* controlled lifetime) pinning of memory pages all current
-* users establish long term (userspace controlled lifetime)
-* page pinning. Treat get_vaddr_frames() like
-* get_user_pages_longterm() and disallow it for filesystem-dax
-* mappings.
-*/
-   if (vma_is_fsdax(vma)) {
-   ret = -EOPNOTSUPP;
-   goto out;
-   }
-
-   if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
+   ret = pin_user_pages_fast(start, nr_frames,
+ FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+ (struct page **)(vec->ptrs));
+   if (ret > 0) {


None of the callers that we have today will accept anything less than
ret == nr_frames. And the whole partially pinned region idea turns out
to be just not useful for almost everyone, from what I recall of the gup/pup
call sites. So I wonder if we should just have get_vaddr_frames do the
cleanup here and return -EFAULT, if ret != nr_frames ?

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 03/17] misc/habana: Stop using frame_vector helpers

2020-10-16 Thread John Hubbard


On 10/9/20 12:59 AM, Daniel Vetter wrote:

All we need are a pages array, pin_user_pages_fast can give us that
directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Cc: Oded Gabbay 
Cc: Omer Shpigelman 
Cc: Ofir Bitton 
Cc: Tomer Tayar 
Cc: Moti Haimovski 
Cc: Daniel Vetter 
Cc: Greg Kroah-Hartman 
Cc: Pawel Piskorski 
--
v2: Use unpin_user_pages_dirty_lock (John)
---
  drivers/misc/habanalabs/Kconfig |  1 -
  drivers/misc/habanalabs/common/habanalabs.h |  3 +-
  drivers/misc/habanalabs/common/memory.c | 49 -
  3 files changed, 20 insertions(+), 33 deletions(-)


Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA



diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
index 8eb5d38c618e..2f04187f7167 100644
--- a/drivers/misc/habanalabs/Kconfig
+++ b/drivers/misc/habanalabs/Kconfig
@@ -6,7 +6,6 @@
  config HABANA_AI
tristate "HabanaAI accelerators (habanalabs)"
depends on PCI && HAS_IOMEM
-   select FRAME_VECTOR
select DMA_SHARED_BUFFER
select GENERIC_ALLOCATOR
select HWMON
diff --git a/drivers/misc/habanalabs/common/habanalabs.h 
b/drivers/misc/habanalabs/common/habanalabs.h
index edbd627b29d2..c1b3ad613b15 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -881,7 +881,8 @@ struct hl_ctx_mgr {
  struct hl_userptr {
enum vm_type_t  vm_type; /* must be first */
struct list_headjob_node;
-   struct frame_vector *vec;
+   struct page **pages;
+   unsigned intnpages;
struct sg_table *sgt;
enum dma_data_direction dir;
struct list_headdebugfs_list;
diff --git a/drivers/misc/habanalabs/common/memory.c 
b/drivers/misc/habanalabs/common/memory.c
index 5ff4688683fd..327b64479f97 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -1281,45 +1281,41 @@ static int get_user_memory(struct hl_device *hdev, u64 
addr, u64 size,
return -EFAULT;
}
  
-	userptr->vec = frame_vector_create(npages);

-   if (!userptr->vec) {
+   userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages),
+   GFP_KERNEL);
+   if (!userptr->pages) {
dev_err(hdev->dev, "Failed to create frame vector\n");
return -ENOMEM;
}
  
-	rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,

-   userptr->vec);
+   rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
+userptr->pages);
  
  	if (rc != npages) {

dev_err(hdev->dev,
"Failed to map host memory, user ptr probably wrong\n");
if (rc < 0)
-   goto destroy_framevec;
+   goto destroy_pages;
+   npages = rc;
rc = -EFAULT;
-   goto put_framevec;
-   }
-
-   if (frame_vector_to_pages(userptr->vec) < 0) {
-   dev_err(hdev->dev,
-   "Failed to translate frame vector to pages\n");
-   rc = -EFAULT;
-   goto put_framevec;
+   goto put_pages;
}
+   userptr->npages = npages;
  
  	rc = sg_alloc_table_from_pages(userptr->sgt,

-   frame_vector_pages(userptr->vec),
-   npages, offset, size, GFP_ATOMIC);
+  userptr->pages,
+  npages, offset, size, GFP_ATOMIC);
if (rc < 0) {
dev_err(hdev->dev, "failed to create SG table from pages\n");
-   goto put_framevec;
+   goto put_pages;
}
  
  	return 0;
  
-put_framevec:

-   put_vaddr_frames(userptr->vec);
-destroy_framevec:
-   frame_vector_destroy(userptr->vec);
+put_pages:
+   unpin_user_pages(userptr->pages, npages);
+destroy_pages:
+   kvfree(userptr->pages);
return rc;
  }
  
@@ -1405,8 +1401,6 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size,

   */
  void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr)
  {
-   struct page **pages;
-
hl_debugfs_remove_userptr(hdev, userptr);
  
  	if (userptr->dma_mapped)

@@ -1414,15 +1408,8 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct 
hl_userptr *userptr)

Re: [PATCH v2 01/17] drm/exynos: Stop using frame_vector helpers

2020-10-16 Thread John Hubbard


On 10/9/20 12:59 AM, Daniel Vetter wrote:

All we need are a pages array, pin_user_pages_fast can give us that
directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Inki Dae 
Cc: Joonyoung Shim 
Cc: Seung-Woo Kim 
Cc: Kyungmin Park 
Cc: Kukjin Kim 
Cc: Krzysztof Kozlowski 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
--
v2: Use unpin_user_pages_dirty_lock (John)
---
  drivers/gpu/drm/exynos/Kconfig  |  1 -
  drivers/gpu/drm/exynos/exynos_drm_g2d.c | 47 +++--
  2 files changed, 20 insertions(+), 28 deletions(-)



Looks good.

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


diff --git a/drivers/gpu/drm/exynos/Kconfig b/drivers/gpu/drm/exynos/Kconfig
index 6417f374b923..43257ef3c09d 100644
--- a/drivers/gpu/drm/exynos/Kconfig
+++ b/drivers/gpu/drm/exynos/Kconfig
@@ -88,7 +88,6 @@ comment "Sub-drivers"
  config DRM_EXYNOS_G2D
bool "G2D"
depends on VIDEO_SAMSUNG_S5P_G2D=n || COMPILE_TEST
-   select FRAME_VECTOR
help
  Choose this option if you want to use Exynos G2D for DRM.
  
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c

index 967a5cdc120e..ecede41af9b9 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -205,7 +205,8 @@ struct g2d_cmdlist_userptr {
dma_addr_t  dma_addr;
unsigned long   userptr;
unsigned long   size;
-   struct frame_vector *vec;
+   struct page **pages;
+   unsigned intnpages;
struct sg_table *sgt;
atomic_trefcount;
boolin_pool;
@@ -378,7 +379,6 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d,
bool force)
  {
struct g2d_cmdlist_userptr *g2d_userptr = obj;
-   struct page **pages;
  
  	if (!obj)

return;
@@ -398,15 +398,9 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d,
dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt,
  DMA_BIDIRECTIONAL, 0);
  
-	pages = frame_vector_pages(g2d_userptr->vec);

-   if (!IS_ERR(pages)) {
-   int i;
-
-   for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)
-   set_page_dirty_lock(pages[i]);
-   }
-   put_vaddr_frames(g2d_userptr->vec);
-   frame_vector_destroy(g2d_userptr->vec);
+   unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages,
+   true);
+   kvfree(g2d_userptr->pages);
  
  	if (!g2d_userptr->out_of_list)

list_del_init(_userptr->list);
@@ -474,35 +468,34 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct 
g2d_data *g2d,
offset = userptr & ~PAGE_MASK;
end = PAGE_ALIGN(userptr + size);
npages = (end - start) >> PAGE_SHIFT;
-   g2d_userptr->vec = frame_vector_create(npages);
-   if (!g2d_userptr->vec) {
+   g2d_userptr->pages = kvmalloc_array(npages, sizeof(*g2d_userptr->pages),
+   GFP_KERNEL);
+   if (!g2d_userptr->pages) {
ret = -ENOMEM;
goto err_free;
}
  
-	ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,

-   g2d_userptr->vec);
+   ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
+ g2d_userptr->pages);
if (ret != npages) {
DRM_DEV_ERROR(g2d->dev,
  "failed to get user pages from userptr.\n");
if (ret < 0)
-   goto err_destroy_framevec;
-   ret = -EFAULT;
-   goto err_put_framevec;
-   }
-   if (frame_vector_to_pages(g2d_userptr->vec) < 0) {
+   goto err_destroy_pages;
+   npages = ret;
ret = -EFAULT;
-   goto err_put_framevec;
+   goto err_unpin_pages;
}
+   g2d_userptr->npages = npages;
  
  	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);

if (!sgt) {
ret = -ENOMEM;
-   goto err_put_framevec;
+   goto err_unpin_pages;
}
  
  	ret = sg_alloc_table_from_pages(sgt,

-   frame_vector_pages(g2d_userptr->vec),
+   g2d_userptr->pages,
npages, offset, size, GFP_KERNEL);
if (ret < 0) {
DRM_DEV_ERR

Re: [PATCH RFC PKS/PMEM 57/58] nvdimm/pmem: Stray access protection for pmem->virt_addr

2020-10-09 Thread John Hubbard


On 10/9/20 12:50 PM, ira.we...@intel.com wrote:

From: Ira Weiny 

The pmem driver uses a cached virtual address to access its memory
directly.  Because the nvdimm driver is well aware of the special
protections it has mapped memory with, we call dev_access_[en|dis]able()
around the direct pmem->virt_addr (pmem_addr) usage instead of the
unnecessary overhead of trying to get a page to kmap.

Signed-off-by: Ira Weiny 
---
  drivers/nvdimm/pmem.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372..e4dc1ae990fc 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -148,7 +148,9 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
if (unlikely(is_bad_pmem(>bb, sector, len)))
return BLK_STS_IOERR;
  
+	dev_access_enable(false);

rc = read_pmem(page, page_off, pmem_addr, len);
+   dev_access_disable(false);


Hi Ira!

The APIs should be tweaked to use a symbol (GLOBAL, PER_THREAD), instead of
true/false. Try reading the above and you'll see that it sounds like it's
doing the opposite of what it is ("enable_this(false)" sounds like a clumsy
API design to *disable*, right?). And there is no hint about the scope.

And it *could* be so much more readable like this:

dev_access_enable(DEV_ACCESS_THIS_THREAD);



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/4] mm: introduce vma_set_file function v2

2020-10-09 Thread John Hubbard


On 10/9/20 12:33 AM, Christian König wrote:

Am 08.10.20 um 23:49 schrieb John Hubbard:

On 10/8/20 4:23 AM, Christian König wrote:
...


diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 3d69e51f3e4d..c9d5f1a38af3 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -893,8 +893,8 @@ int i915_gem_mmap(struct file *filp, struct vm_area_struct 
*vma)
   * requires avoiding extraneous references to their filp, hence why
   * we prefer to use an anonymous file for their mmaps.
   */
-    fput(vma->vm_file);
-    vma->vm_file = anon;
+    vma_set_file(vma, anon);
+    fput(anon);


That's one fput() too many, isn't it?


No, the other cases were replacing the vm_file with something pre-allocated and also grabbed a new 
reference.


But this case here uses the freshly allocated anon file and so vma_set_file() grabs another extra 
reference which we need to drop.


The alternative is to just keep it as it is. Opinions?



I think just a small comment for these cases, is probably about right.


...


diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 10b4be1f3e78..a51dc089896e 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -450,9 +450,8 @@ static int ashmem_mmap(struct file *file, struct 
vm_area_struct *vma)
  vma_set_anonymous(vma);
  }
  -    if (vma->vm_file)
-    fput(vma->vm_file);
-    vma->vm_file = asma->file;
+    vma_set_file(vma, asma->file);
+    fput(asma->file);


Same here: that fput() seems wrong, as it was already done within 
vma_set_file().


No, that case is correct as well. The Android code here has the matching get_file() a few lines up, 
see the surrounding code.


I didn't wanted to replace that since it does some strange error handling here, so the result is 
that we need to drop the extra reference as again.


We could also keep it like it is or maybe better put a TODO comment on it.



Yeah, I think a comment is a good way to go.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/4] mm: introduce vma_set_file function v2

2020-10-08 Thread John Hubbard


On 10/8/20 4:23 AM, Christian König wrote:

Add the new vma_set_file() function to allow changing
vma->vm_file with the necessary refcount dance.

v2: add more users of this.

Signed-off-by: Christian König 
---
  drivers/dma-buf/dma-buf.c  | 16 +---
  drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  4 +---
  drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c |  3 +--
  drivers/gpu/drm/i915/gem/i915_gem_mman.c   |  4 ++--
  drivers/gpu/drm/msm/msm_gem.c  |  4 +---
  drivers/gpu/drm/omapdrm/omap_gem.c |  3 +--
  drivers/gpu/drm/vgem/vgem_drv.c|  3 +--
  drivers/staging/android/ashmem.c   |  5 ++---
  include/linux/mm.h |  2 ++
  mm/mmap.c  | 16 
  10 files changed, 32 insertions(+), 28 deletions(-)


Looks like a nice cleanup. Two comments below.

...


diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 3d69e51f3e4d..c9d5f1a38af3 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -893,8 +893,8 @@ int i915_gem_mmap(struct file *filp, struct vm_area_struct 
*vma)
 * requires avoiding extraneous references to their filp, hence why
 * we prefer to use an anonymous file for their mmaps.
 */
-   fput(vma->vm_file);
-   vma->vm_file = anon;
+   vma_set_file(vma, anon);
+   fput(anon);


That's one fput() too many, isn't it?

...


diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 10b4be1f3e78..a51dc089896e 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -450,9 +450,8 @@ static int ashmem_mmap(struct file *file, struct 
vm_area_struct *vma)
vma_set_anonymous(vma);
}
  
-	if (vma->vm_file)

-   fput(vma->vm_file);
-   vma->vm_file = asma->file;
+   vma_set_file(vma, asma->file);
+   fput(asma->file);


Same here: that fput() seems wrong, as it was already done within 
vma_set_file().


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 07/13] mm: close race in generic_access_phys

2020-10-07 Thread John Hubbard


On 10/7/20 9:44 AM, Daniel Vetter wrote:

Way back it was a reasonable assumptions that iomem mappings never
change the pfn range they point at. But this has changed:

- gpu drivers dynamically manage their memory nowadays, invalidating
   ptes with unmap_mapping_range when buffers get moved

- contiguous dma allocations have moved from dedicated carvetouts to


s/carvetouts/carveouts/


   cma regions. This means if we miss the unmap the pfn might contain
   pagecache or anon memory (well anything allocated with GFP_MOVEABLE)

- even /dev/mem now invalidates mappings when the kernel requests that
   iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
   ("/dev/mem: Revoke mappings when a driver claims the region")


Thanks for putting these references into the log, it's very helpful.
...

diff --git a/mm/memory.c b/mm/memory.c
index fcfc4ca36eba..8d467e23b44e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4873,28 +4873,68 @@ int follow_phys(struct vm_area_struct *vma,
return ret;
  }
  
+/**

+ * generic_access_phys - generic implementation for iomem mmap access
+ * @vma: the vma to access
+ * @addr: userspace addres, not relative offset within @vma
+ * @buf: buffer to read/write
+ * @len: length of transfer
+ * @write: set to FOLL_WRITE when writing, otherwise reading
+ *
+ * This is a generic implementation for _operations_struct.access for an
+ * iomem mapping. This callback is used by access_process_vm() when the @vma is
+ * not page based.
+ */
  int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write)
  {
resource_size_t phys_addr;
unsigned long prot = 0;
void __iomem *maddr;
+   pte_t *ptep, pte;
+   spinlock_t *ptl;
int offset = addr & (PAGE_SIZE-1);
+   int ret = -EINVAL;
+
+   if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
+   return -EINVAL;
+
+retry:
+   if (follow_pte(vma->vm_mm, addr, , ))
+   return -EINVAL;
+   pte = *ptep;
+   pte_unmap_unlock(ptep, ptl);
  
-	if (follow_phys(vma, addr, write, , _addr))

+   prot = pgprot_val(pte_pgprot(pte));
+   phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
+
+   if ((write & FOLL_WRITE) && !pte_write(pte))
return -EINVAL;
  
  	maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot);

if (!maddr)
return -ENOMEM;
  
+	if (follow_pte(vma->vm_mm, addr, , ))

+   goto out_unmap;
+
+   if (pte_same(pte, *ptep)) {



The ioremap area is something I'm sorta new to, so a newbie question:
is it possible for the same pte to already be there, ever? If so, we
be stuck in an infinite loop here.  I'm sure that's not the case, but
it's not yet obvious to me why it's impossible. Resource reservations
maybe?


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 06/13] media: videobuf2: Move frame_vector into media subsystem

2020-10-07 Thread John Hubbard


On 10/7/20 9:44 AM, Daniel Vetter wrote:

It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR
symbol from all over the tree (well just one place, somehow omap media
driver still had this in its Kconfig, despite not using it).

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Pawel Osciak 
Cc: Marek Szyprowski 
Cc: Kyungmin Park 
Cc: Tomasz Figa 
Cc: Mauro Carvalho Chehab 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Cc: Daniel Vetter 
---


Failed to spot any problems here. :)

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


  drivers/media/common/videobuf2/Kconfig|  1 -
  drivers/media/common/videobuf2/Makefile   |  1 +
  .../media/common/videobuf2}/frame_vector.c|  2 +
  drivers/media/platform/omap/Kconfig   |  1 -
  include/linux/mm.h| 42 ---
  include/media/videobuf2-core.h| 42 +++
  mm/Kconfig|  3 --
  mm/Makefile   |  1 -
  8 files changed, 45 insertions(+), 48 deletions(-)
  rename {mm => drivers/media/common/videobuf2}/frame_vector.c (99%)

diff --git a/drivers/media/common/videobuf2/Kconfig 
b/drivers/media/common/videobuf2/Kconfig
index edbc99ebba87..d2223a12c95f 100644
--- a/drivers/media/common/videobuf2/Kconfig
+++ b/drivers/media/common/videobuf2/Kconfig
@@ -9,7 +9,6 @@ config VIDEOBUF2_V4L2
  
  config VIDEOBUF2_MEMOPS

tristate
-   select FRAME_VECTOR
  
  config VIDEOBUF2_DMA_CONTIG

tristate
diff --git a/drivers/media/common/videobuf2/Makefile 
b/drivers/media/common/videobuf2/Makefile
index 77bebe8b202f..54306f8d096c 100644
--- a/drivers/media/common/videobuf2/Makefile
+++ b/drivers/media/common/videobuf2/Makefile
@@ -1,5 +1,6 @@
  # SPDX-License-Identifier: GPL-2.0
  videobuf2-common-objs := videobuf2-core.o
+videobuf2-common-objs += frame_vector.o
  
  ifeq ($(CONFIG_TRACEPOINTS),y)

videobuf2-common-objs += vb2-trace.o
diff --git a/mm/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c
similarity index 99%
rename from mm/frame_vector.c
rename to drivers/media/common/videobuf2/frame_vector.c
index 39db520a51dc..b95f4f371681 100644
--- a/mm/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -8,6 +8,8 @@
  #include 
  #include 
  
+#include 

+
  /**
   * get_vaddr_frames() - map virtual addresses to pfns
   * @start:starting user address
diff --git a/drivers/media/platform/omap/Kconfig 
b/drivers/media/platform/omap/Kconfig
index f73b5893220d..de16de46c0f4 100644
--- a/drivers/media/platform/omap/Kconfig
+++ b/drivers/media/platform/omap/Kconfig
@@ -12,6 +12,5 @@ config VIDEO_OMAP2_VOUT
depends on VIDEO_V4L2
select VIDEOBUF2_DMA_CONTIG
select OMAP2_VRFB if ARCH_OMAP2 || ARCH_OMAP3
-   select FRAME_VECTOR
help
  V4L2 Display driver support for OMAP2/3 based boards.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 16b799a0522c..acd60fbf1a5a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1743,48 +1743,6 @@ int account_locked_vm(struct mm_struct *mm, unsigned 
long pages, bool inc);
  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
struct task_struct *task, bool bypass_rlim);
  
-/* Container for pinned pfns / pages */

-struct frame_vector {
-   unsigned int nr_allocated;  /* Number of frames we have space for */
-   unsigned int nr_frames; /* Number of frames stored in ptrs array */
-   bool got_ref;   /* Did we pin pages by getting page ref? */
-   bool is_pfns;   /* Does array contain pages or pfns? */
-   void *ptrs[];   /* Array of pinned pfns / pages. Use
-* pfns_vector_pages() or pfns_vector_pfns()
-* for access */
-};
-
-struct frame_vector *frame_vector_create(unsigned int nr_frames);
-void frame_vector_destroy(struct frame_vector *vec);
-int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
-unsigned int gup_flags, struct frame_vector *vec);
-void put_vaddr_frames(struct frame_vector *vec);
-int frame_vector_to_pages(struct frame_vector *vec);
-void frame_vector_to_pfns(struct frame_vector *vec);
-
-static inline unsigned int frame_vector_count(struct frame_vector *vec)
-{
-   return vec->nr_frames;
-}
-
-static inline struct page **frame_vector_pages(struct frame_vector *vec)
-{
-   if (vec->is_pfns) {
-   int err = frame_vector_to_pages(vec);
-
-   if (err)
-   return ERR_PTR(err);
-   }
-   return (struct page **)(vec->ptrs);
-}
-
-static inline unsigned long *frame_vector_pfns(struct frame_vector *vec)
-{
-

Re: [PATCH 01/13] drm/exynos: Stop using frame_vector helpers

2020-10-07 Thread John Hubbard


On 10/7/20 2:32 PM, Daniel Vetter wrote:

On Wed, Oct 7, 2020 at 10:33 PM John Hubbard  wrote:


On 10/7/20 9:44 AM, Daniel Vetter wrote:

...

@@ -398,15 +399,11 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d,
   dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt,
 DMA_BIDIRECTIONAL, 0);

- pages = frame_vector_pages(g2d_userptr->vec);
- if (!IS_ERR(pages)) {
- int i;
+ for (i = 0; i < g2d_userptr->npages; i++)
+ set_page_dirty_lock(g2d_userptr->pages[i]);

- for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)
- set_page_dirty_lock(pages[i]);
- }
- put_vaddr_frames(g2d_userptr->vec);
- frame_vector_destroy(g2d_userptr->vec);
+ unpin_user_pages(g2d_userptr->pages, g2d_userptr->npages);
+ kvfree(g2d_userptr->pages);


You can avoid writing your own loop, and just simplify the whole thing down to
two lines:

 unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages,
 true);
 kvfree(g2d_userptr->pages);


Oh nice, this is neat. I'll also roll it out in the habanalabs patch,
that has the same thing. Well almost, it only uses set_page_dirty, not
the _lock variant. But I have no idea whether that matters or not?



It matters. And invariably, call sites that use set_page_dirty() instead
of set_page_dirty_lock() were already wrong.  Which is why I never had to
provide anything like "unpin_user_pages_dirty (not locked)".

Although in habanalabs case, I just reviewed patch 3 and I think they *were*
correctly using set_page_dirty_lock()...

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 05/13] mm/frame-vector: Use FOLL_LONGTERM

2020-10-07 Thread John Hubbard


On 10/7/20 9:44 AM, Daniel Vetter wrote:

This is used by media/videbuf2 for persistent dma mappings, not just
for a single dma operation and then freed again, so needs
FOLL_LONGTERM.

Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
locking issues. Rework the code to pull the pup path out from the
mmap_sem critical section as suggested by Jason.

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Pawel Osciak 
Cc: Marek Szyprowski 
Cc: Kyungmin Park 
Cc: Tomasz Figa 
Cc: Mauro Carvalho Chehab 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
---
  mm/frame_vector.c | 36 +++-
  1 file changed, 11 insertions(+), 25 deletions(-)

diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 10f82d5643b6..39db520a51dc 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -38,7 +38,6 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
struct vm_area_struct *vma;
int ret = 0;
int err;
-   int locked;
  
  	if (nr_frames == 0)

return 0;
@@ -48,35 +47,22 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
  
  	start = untagged_addr(start);
  
+	ret = pin_user_pages_fast(start, nr_frames,

+ FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+ (struct page **)(vec->ptrs));
+   if (ret > 0) {
+   vec->got_ref = true;
+   vec->is_pfns = false;
+   goto out_unlocked;
+   }


This part looks good, and changing to _fast is a potential performance 
improvement,
too.


+
mmap_read_lock(mm);
-   locked = 1;
vma = find_vma_intersection(mm, start, start + 1);
if (!vma) {
ret = -EFAULT;
goto out;
}
  
-	/*

-* While get_vaddr_frames() could be used for transient (kernel
-* controlled lifetime) pinning of memory pages all current
-* users establish long term (userspace controlled lifetime)
-* page pinning. Treat get_vaddr_frames() like
-* get_user_pages_longterm() and disallow it for filesystem-dax
-* mappings.
-*/
-   if (vma_is_fsdax(vma)) {
-   ret = -EOPNOTSUPP;
-   goto out;
-   }


Are you sure we don't need to check vma_is_fsdax() anymore?


-
-   if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
-   vec->got_ref = true;
-   vec->is_pfns = false;
-   ret = pin_user_pages_locked(start, nr_frames,
-   gup_flags, (struct page **)(vec->ptrs), );
-   goto out;
-   }
-
vec->got_ref = false;
vec->is_pfns = true;
do {
@@ -101,8 +87,8 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
vma = find_vma_intersection(mm, start, start + 1);
} while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
  out:
-   if (locked)
-   mmap_read_unlock(mm);
+   mmap_read_unlock(mm);
+out_unlocked:
if (!ret)
ret = -EFAULT;
if (ret > 0)



All of the error handling still looks accurate there.

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/13] misc/habana: Use FOLL_LONGTERM for userptr

2020-10-07 Thread John Hubbard


On 10/7/20 9:44 AM, Daniel Vetter wrote:

These are persistent, not just for the duration of a dma operation.

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Cc: Oded Gabbay 
Cc: Omer Shpigelman 
Cc: Ofir Bitton 
Cc: Tomer Tayar 
Cc: Moti Haimovski 
Cc: Daniel Vetter 
Cc: Greg Kroah-Hartman 
Cc: Pawel Piskorski 
---
  drivers/misc/habanalabs/common/memory.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/common/memory.c 
b/drivers/misc/habanalabs/common/memory.c
index ef89cfa2f95a..94bef8faa82a 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -1288,7 +1288,8 @@ static int get_user_memory(struct hl_device *hdev, u64 
addr, u64 size,
return -ENOMEM;
}
  
-	rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,

+   rc = pin_user_pages_fast(start, npages,
+FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
 userptr->pages);
  
  	if (rc != npages) {




Again, from a pin_user_pages_fast() point of view, and not being at all familiar
with the habana driver (but their use of this really does seem clearly 
_LONGTERM!):

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 02/13] drm/exynos: Use FOLL_LONGTERM for g2d cmdlists

2020-10-07 Thread John Hubbard


On 10/7/20 9:44 AM, Daniel Vetter wrote:

The exynos g2d interface is very unusual, but it looks like the
userptr objects are persistent. Hence they need FOLL_LONGTERM.

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Inki Dae 
Cc: Joonyoung Shim 
Cc: Seung-Woo Kim 
Cc: Kyungmin Park 
Cc: Kukjin Kim 
Cc: Krzysztof Kozlowski 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
---
  drivers/gpu/drm/exynos/exynos_drm_g2d.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c 
b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
index c83f6faac9de..514fd000feb1 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -478,7 +478,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data 
*g2d,
goto err_free;
}
  
-	ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,

+   ret = pin_user_pages_fast(start, npages,
+ FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
  g2d_userptr->pages);
if (ret != npages) {
DRM_DEV_ERROR(g2d->dev,



Looks good from a pin_user_pages_fast() point of view. I'm of course not a 
exynos
developer, so we still need a look from one of those, ideally, but:

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 03/13] misc/habana: Stop using frame_vector helpers

2020-10-07 Thread John Hubbard


On 10/7/20 9:44 AM, Daniel Vetter wrote:
...

@@ -1414,15 +1410,10 @@ void hl_unpin_host_memory(struct hl_device *hdev, 
struct hl_userptr *userptr)
userptr->sgt->nents,
userptr->dir);
  
-	pages = frame_vector_pages(userptr->vec);

-   if (!IS_ERR(pages)) {
-   int i;
-
-   for (i = 0; i < frame_vector_count(userptr->vec); i++)
-   set_page_dirty_lock(pages[i]);
-   }
-   put_vaddr_frames(userptr->vec);
-   frame_vector_destroy(userptr->vec);
+   for (i = 0; i < userptr->npages; i++)
+   set_page_dirty_lock(userptr->pages[i]);
+   unpin_user_pages(userptr->pages, userptr->npages);
+   kvfree(userptr->pages);


Same thing here as in patch 1: you can further simplify by using
unpin_user_pages_dirty_lock().

  
  	list_del(>job_node);
  



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 01/13] drm/exynos: Stop using frame_vector helpers

2020-10-07 Thread John Hubbard


On 10/7/20 9:44 AM, Daniel Vetter wrote:

All we need are a pages array, pin_user_pages_fast can give us that
directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.

Signed-off-by: Daniel Vetter 
Cc: Jason Gunthorpe 
Cc: Inki Dae 
Cc: Joonyoung Shim 
Cc: Seung-Woo Kim 
Cc: Kyungmin Park 
Cc: Kukjin Kim 
Cc: Krzysztof Kozlowski 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
---
  drivers/gpu/drm/exynos/Kconfig  |  1 -
  drivers/gpu/drm/exynos/exynos_drm_g2d.c | 48 -
  2 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/exynos/Kconfig b/drivers/gpu/drm/exynos/Kconfig
index 6417f374b923..43257ef3c09d 100644
--- a/drivers/gpu/drm/exynos/Kconfig
+++ b/drivers/gpu/drm/exynos/Kconfig
@@ -88,7 +88,6 @@ comment "Sub-drivers"
  config DRM_EXYNOS_G2D
bool "G2D"
depends on VIDEO_SAMSUNG_S5P_G2D=n || COMPILE_TEST
-   select FRAME_VECTOR
help
  Choose this option if you want to use Exynos G2D for DRM.
  
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c

index 967a5cdc120e..c83f6faac9de 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -205,7 +205,8 @@ struct g2d_cmdlist_userptr {
dma_addr_t  dma_addr;
unsigned long   userptr;
unsigned long   size;
-   struct frame_vector *vec;
+   struct page **pages;
+   unsigned intnpages;
struct sg_table *sgt;
atomic_trefcount;
boolin_pool;
@@ -378,7 +379,7 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d,
bool force)
  {
struct g2d_cmdlist_userptr *g2d_userptr = obj;
-   struct page **pages;
+   int i;


The above line can also be deleted, see below.

  
  	if (!obj)

return;
@@ -398,15 +399,11 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d,
dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt,
  DMA_BIDIRECTIONAL, 0);
  
-	pages = frame_vector_pages(g2d_userptr->vec);

-   if (!IS_ERR(pages)) {
-   int i;
+   for (i = 0; i < g2d_userptr->npages; i++)
+   set_page_dirty_lock(g2d_userptr->pages[i]);
  
-		for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)

-   set_page_dirty_lock(pages[i]);
-   }
-   put_vaddr_frames(g2d_userptr->vec);
-   frame_vector_destroy(g2d_userptr->vec);
+   unpin_user_pages(g2d_userptr->pages, g2d_userptr->npages);
+   kvfree(g2d_userptr->pages);


You can avoid writing your own loop, and just simplify the whole thing down to
two lines:

unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages,
true);
kvfree(g2d_userptr->pages);


  
  	if (!g2d_userptr->out_of_list)

list_del_init(_userptr->list);
@@ -474,35 +471,34 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct 
g2d_data *g2d,
offset = userptr & ~PAGE_MASK;
end = PAGE_ALIGN(userptr + size);
npages = (end - start) >> PAGE_SHIFT;
-   g2d_userptr->vec = frame_vector_create(npages);
-   if (!g2d_userptr->vec) {
+   g2d_userptr->pages = kvmalloc_array(npages, sizeof(*g2d_userptr->pages),
+   GFP_KERNEL);
+   if (!g2d_userptr->pages) {
ret = -ENOMEM;
goto err_free;
}
  
-	ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,

-   g2d_userptr->vec);
+   ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
+ g2d_userptr->pages);
if (ret != npages) {
DRM_DEV_ERROR(g2d->dev,
  "failed to get user pages from userptr.\n");
if (ret < 0)
-   goto err_destroy_framevec;
-   ret = -EFAULT;
-   goto err_put_framevec;
-   }
-   if (frame_vector_to_pages(g2d_userptr->vec) < 0) {
+   goto err_destroy_pages;
+   npages = ret;
ret = -EFAULT;
-   goto err_put_framevec;
+   goto err_unpin_pages;
}
+   g2d_userptr->npages = npages;
  
  	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);

if (!sgt) {
ret = -ENOMEM;
-   goto err_put_framevec;
+   goto err_unpin_pages;
}
  
  	ret = sg_alloc_table_from_pages(sgt,

-

Re: [PATCH 2/2] mm/frame-vec: use FOLL_LONGTERM

2020-10-03 Thread John Hubbard


On 10/3/20 2:45 AM, Daniel Vetter wrote:

On Sat, Oct 3, 2020 at 12:39 AM John Hubbard  wrote:


On 10/2/20 10:53 AM, Daniel Vetter wrote:

For $reasons I've stumbled over this code and I'm not sure the change
to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
convert get_user_pages() --> pin_user_pages()") was entirely correct.

This here is used for long term buffers (not just quick I/O) like
RDMA, and John notes this in his patch. But I thought the rule for
these is that they need to add FOLL_LONGTERM, which John's patch
didn't do.


Yep. The earlier gup --> pup conversion patches were intended to not
have any noticeable behavior changes, and FOLL_LONGTERM, with it's
special cases and such, added some risk that I wasn't ready to take
on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed
up. So there was some doubt at least in my mind, about which sites
should have it.

But now that we're here, I think it's really good that you've brought
this up. It's definitely time to add FOLL_LONGTERM wherever it's missing.


So should I keep this patch, or will it collide with a series you're working on?


It doesn't collide with anything on my end yet, because I've been slow to
pick up on the need for changing callsites to add FOLL_LONGTERM. :)

And it looks like that's actually a problem, because:



Also with the firmed up rules, correct that I can also drop the
vma_is_fsdax check when the FOLL_LONGTERM flag is set?


That's the right direction to go *in general*, but I see that the
pin_user_pages code is still a bit stuck in the past. And this patch
won't actually work, with or without that vma_is_fsdax() check.
Because:

get_vaddr_frames(FOLL_LONGTERM)
   pin_user_pages_locked()
if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
return -EINVAL;


So, again, pin_user_pages*() is at least partly behind the times here.
I can jump in and start fixing it up, but it depends on what you and
Oded and others are planning? Note: there is a particular combination of
dax and locking that we have to still avoid, within gup.c. That's
already covered, but needs to continue to be covered when we enable
FOLL_LONGTERM in the remaining pin_user_pages*() calling paths.



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/2] mm/frame-vec: use FOLL_LONGTERM

2020-10-02 Thread John Hubbard


On 10/2/20 10:53 AM, Daniel Vetter wrote:

For $reasons I've stumbled over this code and I'm not sure the change
to the new gup functions in 55a650c35fea ("mm/gup: frame_vector:
convert get_user_pages() --> pin_user_pages()") was entirely correct.

This here is used for long term buffers (not just quick I/O) like
RDMA, and John notes this in his patch. But I thought the rule for
these is that they need to add FOLL_LONGTERM, which John's patch
didn't do.


Yep. The earlier gup --> pup conversion patches were intended to not
have any noticeable behavior changes, and FOLL_LONGTERM, with it's
special cases and such, added some risk that I wasn't ready to take
on yet. Also, FOLL_LONGTERM rules are only *recently* getting firmed
up. So there was some doubt at least in my mind, about which sites
should have it.

But now that we're here, I think it's really good that you've brought
this up. It's definitely time to add FOLL_LONGTERM wherever it's missing.

thanks,
--
John Hubbard
NVIDIA



There is already a dax specific check (added in b7f0554a56f2 ("mm:
fail get_vaddr_frames() for filesystem-dax mappings")), so this seems
like the prudent thing to do.

Signed-off-by: Daniel Vetter 
Cc: Andrew Morton 
Cc: John Hubbard 
Cc: Jérôme Glisse 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
---
Hi all,

I stumbled over this and figured typing this patch can't hurt. Really
just to maybe learn a few things about how gup/pup is supposed to be
used (we have a bit of that in drivers/gpu), this here isn't really
ralated to anything I'm doing.

I'm also wondering whether the explicit dax check should be removed,
since FOLL_LONGTERM should take care of that already.
-Daniel
---
  mm/frame_vector.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 5d34c9047e9c..3507e09cb3ff 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -35,7 +35,7 @@ int get_vaddr_frames(unsigned long start, unsigned int 
nr_frames,
  {
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE;
+   unsigned int gup_flags = FOLL_WRITE | FOLL_FORCE | FOLL_LONGTERM;
int ret = 0;
int err;
int locked;



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3] tee: convert convert get_user_pages() --> pin_user_pages()

2020-08-25 Thread John Hubbard


On 8/25/20 1:32 AM, Jens Wiklander wrote:

On Mon, Aug 24, 2020 at 02:11:25PM -0700, John Hubbard wrote:

...

OK, one more try, this time actually handling the _USER_MAPPED vs.
_KERNEL_MAPPED pages!

thanks,
John Hubbard
NVIDIA


Looks good and it works too! :-) I've tested it on my Hikey board with
the OP-TEE test suite.
I'm picking this up.



Great! I see that I have, once again, somehow doubled up on the subject line:
"tee: convert convert ...". This particular typo just seems to stick to me. :)

If you get a chance to fix that up by changing it to just a single "convert"
I'd appreciate it.

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3] tee: convert convert get_user_pages() --> pin_user_pages()

2020-08-24 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

Factor out a new, small release_registered_pages() function, in
order to consolidate the logic for discerning between
TEE_SHM_USER_MAPPED and TEE_SHM_KERNEL_MAPPED pages. This also
absorbs the kfree() call that is also required there.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: Jens Wiklander 
Cc: Sumit Semwal 
Cc: tee-...@lists.linaro.org
Cc: linux-me...@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: John Hubbard 
---

OK, one more try, this time actually handling the _USER_MAPPED vs.
_KERNEL_MAPPED pages!

thanks,
John Hubbard
NVIDIA

 drivers/tee/tee_shm.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index 827ac3d0fea9..00472f5ce22e 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -12,6 +12,22 @@
 #include 
 #include "tee_private.h"
 
+static void release_registered_pages(struct tee_shm *shm)
+{
+   if (shm->pages) {
+   if (shm->flags & TEE_SHM_USER_MAPPED) {
+   unpin_user_pages(shm->pages, shm->num_pages);
+   } else {
+   size_t n;
+
+   for (n = 0; n < shm->num_pages; n++)
+   put_page(shm->pages[n]);
+   }
+
+   kfree(shm->pages);
+   }
+}
+
 static void tee_shm_release(struct tee_shm *shm)
 {
struct tee_device *teedev = shm->ctx->teedev;
@@ -32,17 +48,13 @@ static void tee_shm_release(struct tee_shm *shm)
 
poolm->ops->free(poolm, shm);
} else if (shm->flags & TEE_SHM_REGISTER) {
-   size_t n;
int rc = teedev->desc->ops->shm_unregister(shm->ctx, shm);
 
if (rc)
dev_err(teedev->dev.parent,
"unregister shm %p failed: %d", shm, rc);
 
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
-
-   kfree(shm->pages);
+   release_registered_pages(shm);
}
 
teedev_ctx_put(shm->ctx);
@@ -228,7 +240,7 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
}
 
if (flags & TEE_SHM_USER_MAPPED) {
-   rc = get_user_pages_fast(start, num_pages, FOLL_WRITE,
+   rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE,
 shm->pages);
} else {
struct kvec *kiov;
@@ -292,18 +304,12 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
return shm;
 err:
if (shm) {
-   size_t n;
-
if (shm->id >= 0) {
mutex_lock(>mutex);
idr_remove(>idr, shm->id);
mutex_unlock(>mutex);
}
-   if (shm->pages) {
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
-   kfree(shm->pages);
-   }
+   release_registered_pages(shm);
}
kfree(shm);
teedev_ctx_put(ctx);
-- 
2.28.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2] tee: convert convert get_user_pages() --> pin_user_pages()

2020-08-24 Thread John Hubbard


On 8/24/20 11:36 AM, John Hubbard wrote:

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
 https://lwn.net/Articles/807108/

Cc: Jens Wiklander 
Cc: Sumit Semwal 
Cc: tee-...@lists.linaro.org
Cc: linux-me...@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: John Hubbard 
---

OK, this should be indentical to v1 [1], but now rebased against
Linux 5.9-rc2.



...ohhh, wait, I should have read the earlier message from Jens more
carefully:

"The conflict isn't trivial, I guess we need to handle the different
types of pages differently when releasing them."

So it's not good to have a logically identical patch. argghhh. Let me see
how hard it is to track these memory types separately and handle the release
accordingly, just a sec.

Sorry about the false move here.

thanks,
--
John Hubbard
NVIDIA


As before, I've compile-tested it again with a cross compiler, but that's
the only testing I'm set up for with CONFIG_TEE.

[1] https://lore.kernel.org/r/20200519051850.2845561-1-jhubb...@nvidia.com

thanks,
John Hubbard
NVIDIA

  drivers/tee/tee_shm.c | 12 +++-
  1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index 827ac3d0fea9..3c29e6c3ebe8 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -32,16 +32,13 @@ static void tee_shm_release(struct tee_shm *shm)
  
  		poolm->ops->free(poolm, shm);

} else if (shm->flags & TEE_SHM_REGISTER) {
-   size_t n;
int rc = teedev->desc->ops->shm_unregister(shm->ctx, shm);
  
  		if (rc)

dev_err(teedev->dev.parent,
"unregister shm %p failed: %d", shm, rc);
  
-		for (n = 0; n < shm->num_pages; n++)

-   put_page(shm->pages[n]);
-
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
  
@@ -228,7 +225,7 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, unsigned long addr,

}
  
  	if (flags & TEE_SHM_USER_MAPPED) {

-   rc = get_user_pages_fast(start, num_pages, FOLL_WRITE,
+   rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE,
 shm->pages);
} else {
struct kvec *kiov;
@@ -292,16 +289,13 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
return shm;
  err:
if (shm) {
-   size_t n;
-
if (shm->id >= 0) {
mutex_lock(>mutex);
idr_remove(>idr, shm->id);
mutex_unlock(>mutex);
}
if (shm->pages) {
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
}



v
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v2] tee: convert convert get_user_pages() --> pin_user_pages()

2020-08-24 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: Jens Wiklander 
Cc: Sumit Semwal 
Cc: tee-...@lists.linaro.org
Cc: linux-me...@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: John Hubbard 
---

OK, this should be indentical to v1 [1], but now rebased against
Linux 5.9-rc2.

As before, I've compile-tested it again with a cross compiler, but that's
the only testing I'm set up for with CONFIG_TEE.

[1] https://lore.kernel.org/r/20200519051850.2845561-1-jhubb...@nvidia.com

thanks,
John Hubbard
NVIDIA

 drivers/tee/tee_shm.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index 827ac3d0fea9..3c29e6c3ebe8 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -32,16 +32,13 @@ static void tee_shm_release(struct tee_shm *shm)
 
poolm->ops->free(poolm, shm);
} else if (shm->flags & TEE_SHM_REGISTER) {
-   size_t n;
int rc = teedev->desc->ops->shm_unregister(shm->ctx, shm);
 
if (rc)
dev_err(teedev->dev.parent,
"unregister shm %p failed: %d", shm, rc);
 
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
-
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
 
@@ -228,7 +225,7 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
}
 
if (flags & TEE_SHM_USER_MAPPED) {
-   rc = get_user_pages_fast(start, num_pages, FOLL_WRITE,
+   rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE,
 shm->pages);
} else {
struct kvec *kiov;
@@ -292,16 +289,13 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
return shm;
 err:
if (shm) {
-   size_t n;
-
if (shm->id >= 0) {
mutex_lock(>mutex);
idr_remove(>idr, shm->id);
mutex_unlock(>mutex);
}
if (shm->pages) {
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
}
-- 
2.28.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/2] video: fbdev: fix error handling, convert to pin_user_pages*()

2020-06-01 Thread John Hubbard


On 2020-06-01 10:25, Andy Shevchenko wrote:

On Mon, Jun 1, 2020 at 8:10 PM John Hubbard  wrote:


On 2020-06-01 03:35, Andy Shevchenko wrote:

On Mon, Jun 1, 2020 at 1:00 AM John Hubbard  wrote:

On 2020-05-31 14:11, Andy Shevchenko wrote:

  ...
JFYI, we have history.git starting from v0.01.


OK, thanks for that note. According to that history.git [1],
then: drivers/video/pvr2fb.c had get_user_pages_fast() support added to
pvr2fb_write() back in 2004, but only for CONFIG_SH_DMA, as part of

   commit 434502754f2 ("[PATCH] SH Merge")

...and that commit created the minor bug that patch 0001 here
addresses. (+Cc Paul just for the sake of completeness.)


[1] git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git


I mentioned this one, but I guess content should be the same.

https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/



Actually, that history.git *starts* at Linux 2.6.12-rc2,


It's not true.


OK I see, neither a straight "git log" nor git branches will suffice, you
have to use tags in order to get to the older versions.




while
tglx/history.git *ends* at Linux 2.6.12-rc2 (which is in April, 2005).
And the commit I was looking for is in 2004. So that's why I needed a
different stretch of history.


Actually history/history.git contains all of them starting from v0.01.
But it ends, indeed, on 2.6.33.



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/2] video: fbdev: fix error handling, convert to pin_user_pages*()

2020-06-01 Thread John Hubbard


On 2020-06-01 03:35, Andy Shevchenko wrote:

On Mon, Jun 1, 2020 at 1:00 AM John Hubbard  wrote:

On 2020-05-31 14:11, Andy Shevchenko wrote:

 ...
JFYI, we have history.git starting from v0.01.


OK, thanks for that note. According to that history.git [1],
then: drivers/video/pvr2fb.c had get_user_pages_fast() support added to
pvr2fb_write() back in 2004, but only for CONFIG_SH_DMA, as part of

  commit 434502754f2 ("[PATCH] SH Merge")

...and that commit created the minor bug that patch 0001 here
addresses. (+Cc Paul just for the sake of completeness.)


[1] git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git


I mentioned this one, but I guess content should be the same.

https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/



Actually, that history.git *starts* at Linux 2.6.12-rc2, while
tglx/history.git *ends* at Linux 2.6.12-rc2 (which is in April, 2005).
And the commit I was looking for is in 2004. So that's why I needed a
different stretch of history.



thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/2] video: fbdev: fix error handling, convert to pin_user_pages*()

2020-05-31 Thread John Hubbard


On 2020-05-31 14:11, Andy Shevchenko wrote:

...
JFYI, we have history.git starting from v0.01.


OK, thanks for that note. According to that history.git [1],
then: drivers/video/pvr2fb.c had get_user_pages_fast() support added to
pvr2fb_write() back in 2004, but only for CONFIG_SH_DMA, as part of

commit 434502754f2 ("[PATCH] SH Merge")

...and that commit created the minor bug that patch 0001 here
addresses. (+Cc Paul just for the sake of completeness.)


[1] git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/2] video: fbdev: fix error handling, convert to pin_user_pages*()

2020-05-31 Thread John Hubbard


On 2020-05-31 13:58, Sam Ravnborg wrote:
...

Thanks, patches are now applied to drm-misc-next.
They will hit -next soon, but you will have to wait
until next (not the upcoming) merge window before they hit
mainline linux.

Sam



Great! That will work out just fine.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/radeon: Convert get_user_pages() --> pin_user_pages()

2020-05-29 Thread John Hubbard


On 2020-05-28 23:49, Souptick Joarder wrote:
...

This is what case 3 was *intended* to cover, but it looks like case 3 needs to
be written a little better. I'll attempt that, and Cc you on the actual patch
to -mm. (I think we also need a case 5 for an unrelated scenario, too, so
it's time.)


There were no *case 5* in the other patch posted in -mm. Do we need to add it ?



Working on figuring that out [1], but it's not directly relevant to this thread.
Maybe I shouldn't have brought it up here. :)


[1] https://lore.kernel.org/r/20200529070343.gl14...@quack2.suse.cz

thanks,
John Hubbard
NVIDIA



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/radeon: Convert get_user_pages() --> pin_user_pages()

2020-05-27 Thread John Hubbard


On 2020-05-27 01:51, Daniel Vetter wrote:

On Wed, May 27, 2020 at 10:48:52AM +0200, Daniel Vetter wrote:

On Tue, May 26, 2020 at 03:57:45PM -0700, John Hubbard wrote:

On 2020-05-26 14:00, Souptick Joarder wrote:

This code was using get_user_pages(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages() + release_pages() calls to
pin_user_pages() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
  https://lwn.net/Articles/807108/


I don't think this is a case 2 here, nor is it any of the others. Feels
like not covered at all by the doc.

radeon has a mmu notifier (might be a bit broken, but hey whatever there's
other drivers which have the same concept, but less broken). So when you
do an munmap, radeon will release the page refcount.




Aha, thanks Daniel. I withdraw my misinformed ACK, then.


I forgot to add: It's also not case 3, since there's no hw page fault
support. It's all faked in software, and explicitly synchronizes against
pending io (or preempts it, that depends a bit upon the jobs running).



This is what case 3 was *intended* to cover, but it looks like case 3 needs to
be written a little better. I'll attempt that, and Cc you on the actual patch
to -mm. (I think we also need a case 5 for an unrelated scenario, too, so
it's time.)


thanks,
--
John Hubbard
NVIDIA



Which case it that?

Note that currently only amdgpu doesn't work like that for gpu dma
directly to userspace ranges, it uses hmm and afaiui doens't hold a full
page pin refcount.

Cheers, Daniel




Signed-off-by: Souptick Joarder 
Cc: John Hubbard 

Hi,

I'm compile tested this, but unable to run-time test, so any testing
help is much appriciated.
---
   drivers/gpu/drm/radeon/radeon_ttm.c | 6 +++---
   1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 5d50c9e..e927de2 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -506,7 +506,7 @@ static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
struct page **pages = ttm->pages + pinned;
-   r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
+   r = pin_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
   pages, NULL);
if (r < 0)
goto release_pages;
@@ -535,7 +535,7 @@ static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
kfree(ttm->sg);
   release_pages:
-   release_pages(ttm->pages, pinned);
+   unpin_user_pages(ttm->pages, pinned);
return r;
   }
@@ -562,7 +562,7 @@ static void radeon_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
set_page_dirty(page);



Maybe we also need a preceding patch, to fix the above? It should be
set_page_dirty_lock(), rather than set_page_dirty(), unless I'm overlooking
something (which is very possible!).

Either way, from a tunnel vision perspective of changing gup to pup, this
looks good to me, so

 Acked-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA


mark_page_accessed(page);
-   put_page(page);
+   unpin_user_page(page);
}
sg_free_table(ttm->sg);





--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/radeon: Convert get_user_pages() --> pin_user_pages()

2020-05-26 Thread John Hubbard


On 2020-05-26 14:00, Souptick Joarder wrote:

This code was using get_user_pages(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages() + release_pages() calls to
pin_user_pages() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
 https://lwn.net/Articles/807108/

Signed-off-by: Souptick Joarder 
Cc: John Hubbard 

Hi,

I'm compile tested this, but unable to run-time test, so any testing
help is much appriciated.
---
  drivers/gpu/drm/radeon/radeon_ttm.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 5d50c9e..e927de2 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -506,7 +506,7 @@ static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
struct page **pages = ttm->pages + pinned;
  
-		r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,

+   r = pin_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
   pages, NULL);
if (r < 0)
goto release_pages;
@@ -535,7 +535,7 @@ static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
kfree(ttm->sg);
  
  release_pages:

-   release_pages(ttm->pages, pinned);
+   unpin_user_pages(ttm->pages, pinned);
return r;
  }
  
@@ -562,7 +562,7 @@ static void radeon_ttm_tt_unpin_userptr(struct ttm_tt *ttm)

set_page_dirty(page);



Maybe we also need a preceding patch, to fix the above? It should be
set_page_dirty_lock(), rather than set_page_dirty(), unless I'm overlooking
something (which is very possible!).

Either way, from a tunnel vision perspective of changing gup to pup, this
looks good to me, so

    Acked-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA

  
  		mark_page_accessed(page);

-   put_page(page);
+   unpin_user_page(page);
}
  
  	sg_free_table(ttm->sg);




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v2] drm/etnaviv: convert get_user_pages() --> pin_user_pages()

2020-05-25 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Signed-off-by: John Hubbard 
---

Hi,

Changes since v1:

* Rebased onto Linux 5.7-rc7

* Added: Lucas Stach

thanks
John Hubbard
NVIDIA


 drivers/gpu/drm/etnaviv/etnaviv_gem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index dc9ef302f517..0f4578dc169d 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -675,10 +675,10 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
uint64_t ptr = userptr->ptr + pinned * PAGE_SIZE;
struct page **pages = pvec + pinned;
 
-   ret = get_user_pages_fast(ptr, num_pages,
+   ret = pin_user_pages_fast(ptr, num_pages,
  !userptr->ro ? FOLL_WRITE : 0, pages);
if (ret < 0) {
-   release_pages(pvec, pinned);
+   unpin_user_pages(pvec, pinned);
kvfree(pvec);
return ret;
}
@@ -702,7 +702,7 @@ static void etnaviv_gem_userptr_release(struct 
etnaviv_gem_object *etnaviv_obj)
if (etnaviv_obj->pages) {
int npages = etnaviv_obj->base.size >> PAGE_SHIFT;
 
-   release_pages(etnaviv_obj->pages, npages);
+   unpin_user_pages(etnaviv_obj->pages, npages);
kvfree(etnaviv_obj->pages);
}
 }
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v2] tee: convert get_user_pages() --> pin_user_pages()

2020-05-25 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: Jens Wiklander 
Cc: Sumit Semwal 
Cc: tee-...@lists.linaro.org
Cc: linux-me...@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: John Hubbard 
---

Hi,

This fixes the typo ("convert convert") in the subject line, but
otherwise no changes.

thanks,
John Hubbard
NVIDIA


 drivers/tee/tee_shm.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index bd679b72bd05..7dffc42d8d5a 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -31,16 +31,13 @@ static void tee_shm_release(struct tee_shm *shm)
 
poolm->ops->free(poolm, shm);
} else if (shm->flags & TEE_SHM_REGISTER) {
-   size_t n;
int rc = teedev->desc->ops->shm_unregister(shm->ctx, shm);
 
if (rc)
dev_err(teedev->dev.parent,
"unregister shm %p failed: %d", shm, rc);
 
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
-
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
 
@@ -226,7 +223,7 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
goto err;
}
 
-   rc = get_user_pages_fast(start, num_pages, FOLL_WRITE, shm->pages);
+   rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE, shm->pages);
if (rc > 0)
shm->num_pages = rc;
if (rc != num_pages) {
@@ -271,16 +268,13 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
return shm;
 err:
if (shm) {
-   size_t n;
-
if (shm->id >= 0) {
mutex_lock(>mutex);
idr_remove(>idr, shm->id);
mutex_unlock(>mutex);
}
if (shm->pages) {
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
}
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 0/4] mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()

2020-05-23 Thread John Hubbard


On 2020-05-23 02:41, Chris Wilson wrote:

Quoting John Hubbard (2020-05-22 06:19:27)

The purpose of posting this series is to launch a test in the
intel-gfx-ci tree. (The patches have already been merged into Andrew's
linux-mm tree.)

This applies to today's linux.git (note the base-commit tag at the
bottom).

Changes since V1:

* Fixed a bug in the refactoring patch: added FOLL_FAST_ONLY to the
   list of gup_flags *not* to WARN() on. This lead to a failure in the
   first intel-gfx-ci test run [1].

[1] 
https://lore.kernel.org/r/159008745422.32320.5724805750977048...@build.alporthouse.com


Ran this through our CI, warn and subsequent lockup were gone. That

DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1;
t=1590273216; bh=oK85oUq4LCrgTs8kxvJryKE7a7GUQfAveFtGpNOU2dQ=;
h=X-PGP-Universal:Subject:To:CC:References:From:X-Nvconfidentiality:
 Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:
 X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language:
 Content-Transfer-Encoding;
b=QoI4eJbYYVxcoARKgFJdRrxzB/GBPqy5yKIF46/pjR75LEiZvvAX947VBwywSMYhx
 It8aQpMm6kMaF/rxiv0IPBf3tNGxNziWBAAhDXCyNqmvAS5s1HfdQh5ZoYbyDynKbJ
 uF+u9JjBOYo5uTnn3IUaGPRgl/p9k6OhwRhbJ9nYreDwIF1/1pPeo97jwP2jW7AtDf
 xDO5iJhGmwLYHPzRLilgiDdLbNhIGAP1XJ/4t/DByshidOUalduU7HxVQ9IOnysnCw
 QcqSlpyPgx5LkJOvs63gO8n28hHJnoJ4FggNXC3D311lBWRuD7iekdP5WuvmrxUb8N
 rZKwTpl0vJl9w==


Yea! Thanks again for these test runs. I really don't like posting
patches that I can't run-time test, but this CI system mitigates
that pretty well.



lockup is worrying me now, but that doesn't seem to be an issue from
this series.



I do think it's worth following up on. And it seems like it would be
very easy to repro: just hack in a forced failure at the call site of
pin_user_pages_fast_only(), and follow the breadcrumbs.




The i915 changes were simple enough, I would have computed the pin flags
just once (since the readonly bit is static, that would be interesting
if that was allowed to change mid gup :)
Reviewed-by: Chris Wilson 
-Chris



Thanks for the review! And if lifting that check up higher in the call
stack is desired, I'm all in favor of that being done...in a separate
patch. :)

I'm trying to keep a very light touch when converting these call sites.

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/4] mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()

2020-05-22 Thread John Hubbard


On 2020-05-22 04:40, Souptick Joarder wrote:
...

3) Make it easy for an upcoming patch from Souptick, which aims to
convert __get_user_pages_fast() to use a gup_flags argument, instead
of a bool writeable arg.  Also, if this series looks good, we can
ask Souptick to change the name as well, to whatever the consensus
is. My initial recommendation is: get_user_pages_fast_only(), to
match the new pin_user_pages_only().


Shall I hold my changes till 5.8-rc1 , when this series will appear upstream ?


I don't really see any problem with your posting something that is based on
the latest linux-next (which has my changes now). Should be fine. And in
fact it would be nice to get that done in this round, so that the pin* and
get* APIs look the same.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v2 4/4] drm/i915: convert get_user_pages() --> pin_user_pages()

2020-05-21 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 22 -
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 7ffd7afeb7a5..b55ac7563189 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -471,7 +471,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
down_read(>mmap_sem);
locked = 1;
}
-   ret = get_user_pages_remote
+   ret = pin_user_pages_remote
(work->task, mm,
 obj->userptr.ptr + pinned * PAGE_SIZE,
 npages - pinned,
@@ -507,7 +507,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
}
mutex_unlock(>mm.lock);
 
-   release_pages(pvec, pinned);
+   unpin_user_pages(pvec, pinned);
kvfree(pvec);
 
i915_gem_object_put(obj);
@@ -564,6 +564,7 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
struct sg_table *pages;
bool active;
int pinned;
+   unsigned int gup_flags = 0;
 
/* If userspace should engineer that these pages are replaced in
 * the vma between us binding this page into the GTT and completion
@@ -598,11 +599,14 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
  GFP_KERNEL |
  __GFP_NORETRY |
  __GFP_NOWARN);
-   if (pvec) /* defer to worker if malloc fails */
-   pinned = __get_user_pages_fast(obj->userptr.ptr,
-  num_pages,
-  
!i915_gem_object_is_readonly(obj),
-  pvec);
+   /* defer to worker if malloc fails */
+   if (pvec) {
+   if (!i915_gem_object_is_readonly(obj))
+   gup_flags |= FOLL_WRITE;
+   pinned = pin_user_pages_fast_only(obj->userptr.ptr,
+ num_pages, gup_flags,
+ pvec);
+   }
}
 
active = false;
@@ -620,7 +624,7 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
__i915_gem_userptr_set_active(obj, true);
 
if (IS_ERR(pages))
-   release_pages(pvec, pinned);
+   unpin_user_pages(pvec, pinned);
kvfree(pvec);
 
return PTR_ERR_OR_ZERO(pages);
@@ -675,7 +679,7 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
}
 
mark_page_accessed(page);
-   put_page(page);
+   unpin_user_page(page);
}
obj->mm.dirty = false;
 
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v2 2/4] mm/gup: refactor and de-duplicate gup_fast() code

2020-05-21 Thread John Hubbard

There were two nearly identical sets of code for gup_fast()
style of walking the page tables with interrupts disabled.
This has lead to the usual maintenance problems that arise from
having duplicated code.

There is already a core internal routine in gup.c for gup_fast(),
so just enhance it very slightly: allow skipping the fall-back
to "slow" (regular) get_user_pages(), via the new FOLL_FAST_ONLY
flag. Then, just call internal_get_user_pages_fast() from
__get_user_pages_fast(), and adjust the API to match pre-existing
API behavior.

There is a change in behavior from this refactoring: the nested
form of interrupt disabling is used in all gup_fast() variants
now. That's because there is only one place that interrupt disabling
for page walking is done, and so the safer form is required. This
should, if anything, eliminate possible (rare) bugs, because the
non-nested form of enabling interrupts was fragile at best.

Signed-off-by: John Hubbard 
---
 include/linux/mm.h |  1 +
 mm/gup.c   | 63 ++
 2 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5594ac9ebe3..84b601cab699 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2782,6 +2782,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_LONGTERM  0x1 /* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
+#define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 4502846d57f9..4564b0dc7d0b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2694,10 +2694,12 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
struct page **pages)
 {
unsigned long addr, len, end;
+   unsigned long flags;
int nr_pinned = 0, ret = 0;
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-  FOLL_FORCE | FOLL_PIN | FOLL_GET)))
+  FOLL_FORCE | FOLL_PIN | FOLL_GET |
+  FOLL_FAST_ONLY)))
return -EINVAL;
 
start = untagged_addr(start) & PAGE_MASK;
@@ -2710,15 +2712,26 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
if (unlikely(!access_ok((void __user *)start, len)))
return -EFAULT;
 
+   /*
+* Disable interrupts. The nested form is used, in order to allow full,
+* general purpose use of this routine.
+*
+* With interrupts disabled, we block page table pages from being
+* freed from under us. See struct mmu_table_batch comments in
+* include/asm-generic/tlb.h for more details.
+*
+* We do not adopt an rcu_read_lock(.) here as we also want to
+* block IPIs that come from THPs splitting.
+*/
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
-   local_irq_disable();
+   local_irq_save(flags);
gup_pgd_range(addr, end, gup_flags, pages, _pinned);
-   local_irq_enable();
+   local_irq_restore(flags);
ret = nr_pinned;
}
 
-   if (nr_pinned < nr_pages) {
+   if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) {
/* Try to get the remaining pages with get_user_pages */
start += nr_pinned << PAGE_SHIFT;
pages += nr_pinned;
@@ -2750,45 +2763,29 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages)
 {
-   unsigned long len, end;
-   unsigned long flags;
-   int nr_pinned = 0;
+   int nr_pinned;
/*
 * Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
 * because gup fast is always a "pin with a +1 page refcount" request.
+*
+* FOLL_FAST_ONLY is required in order to match the API description of
+* this routine: no fall back to regular ("slow") GUP.
 */
-   unsigned int gup_flags = FOLL_GET;
+   unsigned int gup_flags = FOLL_GET | FOLL_FAST_ONLY;
 
if (write)
gup_flags |= FOLL_WRITE;
 
-   start = untagged_addr(start) & PAGE_MASK;
-   len = (unsigned long) nr_pages << PAGE_SHIFT;
-   end = start + len;
-
-   if (end <= start)
-   return 0;
-   if (unlikely(!access_ok((void __user *)start, len)))
-   retu

[PATCH v2 0/4] mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()

2020-05-21 Thread John Hubbard

The purpose of posting this series is to launch a test in the
intel-gfx-ci tree. (The patches have already been merged into Andrew's
linux-mm tree.)

This applies to today's linux.git (note the base-commit tag at the
bottom).

Changes since V1:

* Fixed a bug in the refactoring patch: added FOLL_FAST_ONLY to the
  list of gup_flags *not* to WARN() on. This lead to a failure in the
  first intel-gfx-ci test run [1].

[1] 
https://lore.kernel.org/r/159008745422.32320.5724805750977048...@build.alporthouse.com

Original cover letter:

This needs to go through Andrew's -mm tree, due to adding a new gup.c
routine. However, I would really love to have some testing from the
drm/i915 folks, because I haven't been able to run-time test that part
of it.

Otherwise, though, the series has passed my basic run time testing:
some LTP tests, some xfs and etx4 non-destructive xfstests, and an
assortment of other smaller ones: vm selftests, io_uring_register, a
few more. But that's only on one particular machine. Also, cross-compile
tests for half a dozen arches all pass.

Details:

In order to convert the drm/i915 driver from get_user_pages() to
pin_user_pages(), a FOLL_PIN equivalent of __get_user_pages_fast() was
required. That led to refactoring __get_user_pages_fast(), with the
following goals:

1) As above: provide a pin_user_pages*() routine for drm/i915 to call,
   in place of __get_user_pages_fast(),

2) Get rid of the gup.c duplicate code for walking page tables with
   interrupts disabled. This duplicate code is a minor maintenance
   problem anyway.

3) Make it easy for an upcoming patch from Souptick, which aims to
   convert __get_user_pages_fast() to use a gup_flags argument, instead
   of a bool writeable arg.  Also, if this series looks good, we can
   ask Souptick to change the name as well, to whatever the consensus
   is. My initial recommendation is: get_user_pages_fast_only(), to
   match the new pin_user_pages_only().

John Hubbard (4):
  mm/gup: move __get_user_pages_fast() down a few lines in gup.c
  mm/gup: refactor and de-duplicate gup_fast() code
  mm/gup: introduce pin_user_pages_fast_only()
  drm/i915: convert get_user_pages() --> pin_user_pages()

 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  22 +--
 include/linux/mm.h  |   3 +
 mm/gup.c| 153 
 3 files changed, 109 insertions(+), 69 deletions(-)


base-commit: 051143e1602d90ea71887d92363edd539d411de5
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v2 3/4] mm/gup: introduce pin_user_pages_fast_only()

2020-05-21 Thread John Hubbard

This is the FOLL_PIN equivalent of __get_user_pages_fast(),
except with a more descriptive name, and gup_flags instead of
a boolean "write" in the argument list.

Signed-off-by: John Hubbard 
---
 include/linux/mm.h |  2 ++
 mm/gup.c   | 36 
 2 files changed, 38 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 84b601cab699..98be7289d7e9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1820,6 +1820,8 @@ extern int mprotect_fixup(struct vm_area_struct *vma,
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
+int pin_user_pages_fast_only(unsigned long start, int nr_pages,
+unsigned int gup_flags, struct page **pages);
 /*
  * per-process(per-mm_struct) statistics.
  */
diff --git a/mm/gup.c b/mm/gup.c
index 4564b0dc7d0b..6fa9b2016a53 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2859,6 +2859,42 @@ int pin_user_pages_fast(unsigned long start, int 
nr_pages,
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
+/*
+ * This is the FOLL_PIN equivalent of __get_user_pages_fast(). Behavior is the
+ * same, except that this one sets FOLL_PIN instead of FOLL_GET.
+ *
+ * The API rules are the same, too: no negative values may be returned.
+ */
+int pin_user_pages_fast_only(unsigned long start, int nr_pages,
+unsigned int gup_flags, struct page **pages)
+{
+   int nr_pinned;
+
+   /*
+* FOLL_GET and FOLL_PIN are mutually exclusive. Note that the API
+* rules require returning 0, rather than -errno:
+*/
+   if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+   return 0;
+   /*
+* FOLL_FAST_ONLY is required in order to match the API description of
+* this routine: no fall back to regular ("slow") GUP.
+*/
+   gup_flags |= (FOLL_PIN | FOLL_FAST_ONLY);
+   nr_pinned = internal_get_user_pages_fast(start, nr_pages, gup_flags,
+pages);
+   /*
+* This routine is not allowed to return negative values. However,
+* internal_get_user_pages_fast() *can* return -errno. Therefore,
+* correct for that here:
+*/
+   if (nr_pinned < 0)
+   nr_pinned = 0;
+
+   return nr_pinned;
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
+
 /**
  * pin_user_pages_remote() - pin pages of a remote process (task != current)
  *
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v2 1/4] mm/gup: move __get_user_pages_fast() down a few lines in gup.c

2020-05-21 Thread John Hubbard

This is in order to avoid a forward declaration of
internal_get_user_pages_fast(), in the next patch.

This is code movement only--all generated code should
be identical.

Signed-off-by: John Hubbard 
---
 mm/gup.c | 112 +++
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 50cd9323efff..4502846d57f9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2666,62 +2666,6 @@ static bool gup_fast_permitted(unsigned long start, 
unsigned long end)
 }
 #endif
 
-/*
- * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back 
to
- * the regular GUP.
- * Note a difference with get_user_pages_fast: this always returns the
- * number of pages pinned, 0 if no pages were pinned.
- *
- * If the architecture does not support this function, simply return with no
- * pages pinned.
- */
-int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
- struct page **pages)
-{
-   unsigned long len, end;
-   unsigned long flags;
-   int nr_pinned = 0;
-   /*
-* Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
-* because gup fast is always a "pin with a +1 page refcount" request.
-*/
-   unsigned int gup_flags = FOLL_GET;
-
-   if (write)
-   gup_flags |= FOLL_WRITE;
-
-   start = untagged_addr(start) & PAGE_MASK;
-   len = (unsigned long) nr_pages << PAGE_SHIFT;
-   end = start + len;
-
-   if (end <= start)
-   return 0;
-   if (unlikely(!access_ok((void __user *)start, len)))
-   return 0;
-
-   /*
-* Disable interrupts.  We use the nested form as we can already have
-* interrupts disabled by get_futex_key.
-*
-* With interrupts disabled, we block page table pages from being
-* freed from under us. See struct mmu_table_batch comments in
-* include/asm-generic/tlb.h for more details.
-*
-* We do not adopt an rcu_read_lock(.) here as we also want to
-* block IPIs that come from THPs splitting.
-*/
-
-   if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
-   gup_fast_permitted(start, end)) {
-   local_irq_save(flags);
-   gup_pgd_range(start, end, gup_flags, pages, _pinned);
-   local_irq_restore(flags);
-   }
-
-   return nr_pinned;
-}
-EXPORT_SYMBOL_GPL(__get_user_pages_fast);
-
 static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
   unsigned int gup_flags, struct page **pages)
 {
@@ -2794,6 +2738,62 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
return ret;
 }
 
+/*
+ * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back 
to
+ * the regular GUP.
+ * Note a difference with get_user_pages_fast: this always returns the
+ * number of pages pinned, 0 if no pages were pinned.
+ *
+ * If the architecture does not support this function, simply return with no
+ * pages pinned.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+ struct page **pages)
+{
+   unsigned long len, end;
+   unsigned long flags;
+   int nr_pinned = 0;
+   /*
+* Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
+* because gup fast is always a "pin with a +1 page refcount" request.
+*/
+   unsigned int gup_flags = FOLL_GET;
+
+   if (write)
+   gup_flags |= FOLL_WRITE;
+
+   start = untagged_addr(start) & PAGE_MASK;
+   len = (unsigned long) nr_pages << PAGE_SHIFT;
+   end = start + len;
+
+   if (end <= start)
+   return 0;
+   if (unlikely(!access_ok((void __user *)start, len)))
+   return 0;
+
+   /*
+* Disable interrupts.  We use the nested form as we can already have
+* interrupts disabled by get_futex_key.
+*
+* With interrupts disabled, we block page table pages from being
+* freed from under us. See struct mmu_table_batch comments in
+* include/asm-generic/tlb.h for more details.
+*
+* We do not adopt an rcu_read_lock(.) here as we also want to
+* block IPIs that come from THPs splitting.
+*/
+
+   if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
+   gup_fast_permitted(start, end)) {
+   local_irq_save(flags);
+   gup_pgd_range(start, end, gup_flags, pages, _pinned);
+   local_irq_restore(flags);
+   }
+
+   return nr_pinned;
+}
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);
+
 /**
  * get_user_pages_fast() - pin user pages in memory
  * @start:  starting user address
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 1/2] video: fbdev: fix error handling for get_user_pages_fast()

2020-05-21 Thread John Hubbard

Dealing with the return value of get_user_pages*() variants has a few
classic pitfalls, and this driver found one of them: the return value
might be zero, positive, or -errno. And if positive, it might be fewer
pages than were requested. And if fewer pages than requested, then
the caller should return (via put_page()) the pages that *were*
pinned.

This driver was doing that *except* that it had a problem with the
-errno case, which was being stored in an unsigned int, and which
would case an interesting mess if it ever happened: nr_pages would be
interpreted as a spectacularly huge unsigned value, rather than a
small negative value. Also, it was unnecessarily overriding a
potentially informative -errno, with -EINVAL, in some cases.

Instead: clamp the nr_pages to zero or positive, so that the error
handling works. And return the -errno value from get_user_pages*(),
unchanged, if we get one. And explain this with comments, seeing as
how it is error-prone.

Cc: Bartlomiej Zolnierkiewicz 
Cc: Arnd Bergmann 
Cc: Daniel Vetter 
Cc: Gustavo A. R. Silva 
Cc: Jani Nikula 
Cc: dri-devel@lists.freedesktop.org
Cc: linux-fb...@vger.kernel.org
Signed-off-by: John Hubbard 
---
 drivers/video/fbdev/pvr2fb.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index f18d457175d9..ceb6ef590597 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -654,8 +654,22 @@ static ssize_t pvr2fb_write(struct fb_info *info, const 
char *buf,
 
ret = get_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, 
pages);
if (ret < nr_pages) {
-   nr_pages = ret;
-   ret = -EINVAL;
+   if (ret < 0) {
+   /*
+*  Clamp the unsigned nr_pages to zero so that the
+*  error handling works. And leave ret at whatever
+*  -errno value was returned from GUP.
+*/
+   nr_pages = 0;
+   } else {
+   nr_pages = ret;
+   /*
+* Use -EINVAL to represent a mildly desperate guess at
+* why we got fewer pages (maybe even zero pages) than
+* requested.
+*/
+   ret = -EINVAL;
+   }
goto out_unmap;
}
 
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 0/2] video: fbdev: fix error handling, convert to pin_user_pages*()

2020-05-21 Thread John Hubbard

Hi,

Note that I have only compile-tested this series, although that does
also include cross-compiling for a few other arches. I'm hoping that
this posting will lead to some run-time testing.

Also: the proposed fix does not have a "Fixes:" tag, nor does it
Cc stable. That's because the issue has been there since the dawn of
git history for the kernel. If it's gone unnoticed this long, then
there is clearly no need for the relatively fast track of putting it
into stable, IMHO. But please correct me if that's wrong.

Cc: Bartlomiej Zolnierkiewicz 
Cc: Arnd Bergmann 
Cc: Daniel Vetter 
Cc: Gustavo A. R. Silva 
Cc: Jani Nikula 
Cc: dri-devel@lists.freedesktop.org
Cc: linux-fb...@vger.kernel.org

John Hubbard (2):
  video: fbdev: fix error handling for get_user_pages_fast()
  video: fbdev: convert get_user_pages() --> pin_user_pages()

 drivers/video/fbdev/pvr2fb.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)


base-commit: 051143e1602d90ea71887d92363edd539d411de5
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 2/2] video: fbdev: convert get_user_pages() --> pin_user_pages()

2020-05-21 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: Bartlomiej Zolnierkiewicz 
Cc: Arnd Bergmann 
Cc: Daniel Vetter 
Cc: Gustavo A. R. Silva 
Cc: Jani Nikula 
Cc: dri-devel@lists.freedesktop.org
Cc: linux-fb...@vger.kernel.org
Signed-off-by: John Hubbard 
---
 drivers/video/fbdev/pvr2fb.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index ceb6ef590597..2d9f69b93392 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -652,7 +652,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const 
char *buf,
if (!pages)
return -ENOMEM;
 
-   ret = get_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, 
pages);
+   ret = pin_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, 
pages);
if (ret < nr_pages) {
if (ret < 0) {
/*
@@ -712,9 +712,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const 
char *buf,
ret = count;
 
 out_unmap:
-   for (i = 0; i < nr_pages; i++)
-   put_page(pages[i]);
-
+   unpin_user_pages(pages, nr_pages);
kfree(pages);
 
return ret;
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm/gup: fixup gup.c for "mm/gup: refactor and de-duplicate gup_fast() code"

2020-05-21 Thread John Hubbard

On 2020-05-21 19:46, Chris Wilson wrote:

Quoting John Hubbard (2020-05-22 00:38:41)

Include FOLL_FAST_ONLY in the list of flags to *not* WARN()
on, in internal_get_user_pages_fast().

Cc: Chris Wilson 
Cc: Daniel Vetter 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: "Joonas Lahtinen" 
Cc: Matthew Auld 
Cc: Matthew Wilcox 
Cc: Rodrigo Vivi 
Cc: Souptick Joarder 
Cc: Tvrtko Ursulin 
Signed-off-by: John Hubbard 
---

Hi Andrew, Chris,

Andrew: This is a fixup that applies to today's (20200521) linux-next.
In that tree, this fixes up:

commit dfb8dfe80808 ("mm/gup: refactor and de-duplicate gup_fast() code")

Chris: I'd like to request another CI run for the drm/i915 changes, so
for that, would you prefer that I post a v2 of the series [1], or
is it easier for you to just apply this patch here, on top of [2]?

If you post your series again with this patch included to intel-gfx, CI
will pick it up. Or I'll do that in the morning.
-Chris

OK, perfect. I'll post a version for linux.git in a moment here.

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH] mm/gup: fixup gup.c for "mm/gup: refactor and de-duplicate gup_fast() code"

2020-05-21 Thread John Hubbard

Include FOLL_FAST_ONLY in the list of flags to *not* WARN()
on, in internal_get_user_pages_fast().

Cc: Chris Wilson 
Cc: Daniel Vetter 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: "Joonas Lahtinen" 
Cc: Matthew Auld 
Cc: Matthew Wilcox 
Cc: Rodrigo Vivi 
Cc: Souptick Joarder 
Cc: Tvrtko Ursulin 
Signed-off-by: John Hubbard 
---

Hi Andrew, Chris,

Andrew: This is a fixup that applies to today's (20200521) linux-next.
In that tree, this fixes up:

commit dfb8dfe80808 ("mm/gup: refactor and de-duplicate gup_fast() code")

Chris: I'd like to request another CI run for the drm/i915 changes, so
for that, would you prefer that I post a v2 of the series [1], or
is it easier for you to just apply this patch here, on top of [2]?

[1] https://lore.kernel.org/r/20200519002124.2025955-1-jhubb...@nvidia.com

[2] 
https://lore.kernel.org/r/158985123351.31239.10766458886430429...@emeril.freedesktop.org

thanks,
John Hubbard
NVIDIA

 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index dd8895f2fafa1..ada6aa79576dc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2724,7 +2724,8 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
int nr_pinned = 0, ret = 0;
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-  FOLL_FORCE | FOLL_PIN | FOLL_GET)))
+  FOLL_FORCE | FOLL_PIN | FOLL_GET |
+  FOLL_FAST_ONLY)))
return -EINVAL;
 
start = untagged_addr(start) & PAGE_MASK;
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Solved: [PATCH 0/4] mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()

2020-05-21 Thread John Hubbard


On 2020-05-21 12:11, John Hubbard wrote:

On 2020-05-21 11:57, Chris Wilson wrote:

Quoting John Hubbard (2020-05-19 01:21:20)

This needs to go through Andrew's -mm tree, due to adding a new gup.c
routine. However, I would really love to have some testing from the
drm/i915 folks, because I haven't been able to run-time test that part
of it.


CI hit

<4> [185.667750] WARNING: CPU: 0 PID: 1387 at mm/gup.c:2699 
internal_get_user_pages_fast+0x63a/0xac0



OK, what happened here is that it's WARN()'ing due to passing in the new
FOLL_FAST_ONLY flag, which was not added to the whitelist.

So the fix is easy, and should be applied to the refactoring patch. I'll
send out a v2 of the series, which will effectively have this applied:


diff --git a/mm/gup.c b/mm/gup.c
index 6cbe98c93466..4f0ca3f849d1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2696,7 +2696,8 @@ static int internal_get_user_pages_fast(unsigned long start, 
int nr_pages,

int nr_pinned = 0, ret = 0;

if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-  FOLL_FORCE | FOLL_PIN | FOLL_GET)))
+  FOLL_FORCE | FOLL_PIN | FOLL_GET |
+  FOLL_FAST_ONLY)))
return -EINVAL;

start = untagged_addr(start) & PAGE_MASK;


<4> [185.667752] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek 
snd_hda_codec_generic i915 mei_hdcp x86_pkg_temp_thermal coretemp snd_hda_intel 
snd_intel_dspcfg crct10dif_pclmul snd_hda_codec crc32_pclmul snd_hwdep snd_hda_core 
ghash_clmulni_intel cdc_ether usbnet mii snd_pcm e1000e mei_me ptp pps_core mei 
intel_lpss_pci prime_numbers
<4> [185.667774] CPU: 0 PID: 1387 Comm: gem_userptr_bli Tainted: G U
5.7.0-rc5-CI-Patchwork_17704+ #1
<4> [185.66] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake 
U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.3234.A01.1906141750 06/14/2019

<4> [185.667782] RIP: 0010:internal_get_user_pages_fast+0x63a/0xac0
<4> [185.667785] Code: 24 40 08 48 39 5c 24 38 49 89 df 0f 85 74 fc ff ff 48 83 44 
24 50 08 48 39 5c 24 58 49 89 dc 0f 85 e0 fb ff ff e9 14 fe ff ff <0f> 0b b8 ea ff 
ff ff e9 36 fb ff ff 4c 89 e8 48 21 e8 48 39 e8 0f

<4> [185.667789] RSP: 0018:c90001133c38 EFLAGS: 00010206
<4> [185.667792] RAX:  RBX:  RCX: 
8884999ee800
<4> [185.667795] RDX: 000c0001 RSI: 0100 RDI: 
7f419e774000
<4> [185.667798] RBP: 888453dbf040 R08:  R09: 
0001
<4> [185.667800] R10:  R11:  R12: 
888453dbf380
<4> [185.667803] R13: 8884999ee800 R14: 888453dbf3e8 R15: 
0040
<4> [185.667806] FS:  7f419e875e40() GS:88849fe0() 
knlGS:

<4> [185.667808] CS:  0010 DS:  ES:  CR0: 80050033
<4> [185.667811] CR2: 7f419e873000 CR3: 000458bd2004 CR4: 
00760ef0
<4> [185.667814] PKRU: 5554
<4> [185.667816] Call Trace:
<4> [185.667912]  ? i915_gem_userptr_get_pages+0x1c6/0x290 [i915]
<4> [185.667918]  ? mark_held_locks+0x49/0x70
<4> [185.667998]  ? i915_gem_userptr_get_pages+0x1c6/0x290 [i915]
<4> [185.668073]  ? i915_gem_userptr_get_pages+0x1c6/0x290 [i915]

and then panicked, across a range of systems.
-Chris



btw, the panic seems to indicate an additional, pre-existing problem:
i915_gem_userptr_get_pages(), in this case at least, is not able to
recover from a get_user_pages/pin_user_pages failure.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/4] mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()

2020-05-21 Thread John Hubbard


On 2020-05-21 11:57, Chris Wilson wrote:

Quoting John Hubbard (2020-05-19 01:21:20)

This needs to go through Andrew's -mm tree, due to adding a new gup.c
routine. However, I would really love to have some testing from the
drm/i915 folks, because I haven't been able to run-time test that part
of it.


CI hit

<4> [185.667750] WARNING: CPU: 0 PID: 1387 at mm/gup.c:2699 
internal_get_user_pages_fast+0x63a/0xac0
<4> [185.667752] Modules linked in: vgem snd_hda_codec_hdmi 
snd_hda_codec_realtek snd_hda_codec_generic i915 mei_hdcp x86_pkg_temp_thermal 
coretemp snd_hda_intel snd_intel_dspcfg crct10dif_pclmul snd_hda_codec crc32_pclmul 
snd_hwdep snd_hda_core ghash_clmulni_intel cdc_ether usbnet mii snd_pcm e1000e mei_me 
ptp pps_core mei intel_lpss_pci prime_numbers
<4> [185.667774] CPU: 0 PID: 1387 Comm: gem_userptr_bli Tainted: G U
5.7.0-rc5-CI-Patchwork_17704+ #1
<4> [185.66] Hardware name: Intel Corporation Ice Lake Client 
Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.3234.A01.1906141750 
06/14/2019
<4> [185.667782] RIP: 0010:internal_get_user_pages_fast+0x63a/0xac0
<4> [185.667785] Code: 24 40 08 48 39 5c 24 38 49 89 df 0f 85 74 fc ff ff 48 83 44 24 
50 08 48 39 5c 24 58 49 89 dc 0f 85 e0 fb ff ff e9 14 fe ff ff <0f> 0b b8 ea ff ff ff 
e9 36 fb ff ff 4c 89 e8 48 21 e8 48 39 e8 0f
<4> [185.667789] RSP: 0018:c90001133c38 EFLAGS: 00010206
<4> [185.667792] RAX:  RBX:  RCX: 
8884999ee800
<4> [185.667795] RDX: 000c0001 RSI: 0100 RDI: 
7f419e774000
<4> [185.667798] RBP: 888453dbf040 R08:  R09: 
0001
<4> [185.667800] R10:  R11:  R12: 
888453dbf380
<4> [185.667803] R13: 8884999ee800 R14: 888453dbf3e8 R15: 
0040
<4> [185.667806] FS:  7f419e875e40() GS:88849fe0() 
knlGS:
<4> [185.667808] CS:  0010 DS:  ES:  CR0: 80050033
<4> [185.667811] CR2: 7f419e873000 CR3: 000458bd2004 CR4: 
00760ef0
<4> [185.667814] PKRU: 5554
<4> [185.667816] Call Trace:
<4> [185.667912]  ? i915_gem_userptr_get_pages+0x1c6/0x290 [i915]
<4> [185.667918]  ? mark_held_locks+0x49/0x70
<4> [185.667998]  ? i915_gem_userptr_get_pages+0x1c6/0x290 [i915]
<4> [185.668073]  ? i915_gem_userptr_get_pages+0x1c6/0x290 [i915]

and then panicked, across a range of systems.
-Chris



Thanks for this report! I'm looking into it now.

thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] tee: convert convert get_user_pages() --> pin_user_pages()

2020-05-19 Thread John Hubbard


On 2020-05-18 22:18, John Hubbard wrote:

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.



Looks like I accidentally doubled a word on the subject line:
"convert convert". :)

I'd appreciate it a maintainer could remove one of those for
me, while applying the patch, assuming that we don't need a
v2 for other reasons.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH] tee: convert convert get_user_pages() --> pin_user_pages()

2020-05-18 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: Jens Wiklander 
Cc: Sumit Semwal 
Cc: tee-...@lists.linaro.org
Cc: linux-me...@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: John Hubbard 
---

Note that I have only compile-tested this patch, although that does
also include cross-compiling for a few other arches.

thanks,
John Hubbard
NVIDIA

 drivers/tee/tee_shm.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index bd679b72bd05..7dffc42d8d5a 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -31,16 +31,13 @@ static void tee_shm_release(struct tee_shm *shm)
 
poolm->ops->free(poolm, shm);
} else if (shm->flags & TEE_SHM_REGISTER) {
-   size_t n;
int rc = teedev->desc->ops->shm_unregister(shm->ctx, shm);
 
if (rc)
dev_err(teedev->dev.parent,
"unregister shm %p failed: %d", shm, rc);
 
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
-
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
 
@@ -226,7 +223,7 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
goto err;
}
 
-   rc = get_user_pages_fast(start, num_pages, FOLL_WRITE, shm->pages);
+   rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE, shm->pages);
if (rc > 0)
shm->num_pages = rc;
if (rc != num_pages) {
@@ -271,16 +268,13 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, 
unsigned long addr,
return shm;
 err:
if (shm) {
-   size_t n;
-
if (shm->id >= 0) {
mutex_lock(>mutex);
idr_remove(>idr, shm->id);
mutex_unlock(>mutex);
}
if (shm->pages) {
-   for (n = 0; n < shm->num_pages; n++)
-   put_page(shm->pages[n]);
+   unpin_user_pages(shm->pages, shm->num_pages);
kfree(shm->pages);
}
}
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 3/4] mm/gup: introduce pin_user_pages_fast_only()

2020-05-18 Thread John Hubbard

This is the FOLL_PIN equivalent of __get_user_pages_fast(),
except with a more descriptive name, and gup_flags instead of
a boolean "write" in the argument list.

Signed-off-by: John Hubbard 
---
 include/linux/mm.h |  2 ++
 mm/gup.c   | 36 
 2 files changed, 38 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 84b601cab699..98be7289d7e9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1820,6 +1820,8 @@ extern int mprotect_fixup(struct vm_area_struct *vma,
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
+int pin_user_pages_fast_only(unsigned long start, int nr_pages,
+unsigned int gup_flags, struct page **pages);
 /*
  * per-process(per-mm_struct) statistics.
  */
diff --git a/mm/gup.c b/mm/gup.c
index bb3e2c4288c3..4413f0f94b68 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2858,6 +2858,42 @@ int pin_user_pages_fast(unsigned long start, int 
nr_pages,
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
+/*
+ * This is the FOLL_PIN equivalent of __get_user_pages_fast(). Behavior is the
+ * same, except that this one sets FOLL_PIN instead of FOLL_GET.
+ *
+ * The API rules are the same, too: no negative values may be returned.
+ */
+int pin_user_pages_fast_only(unsigned long start, int nr_pages,
+unsigned int gup_flags, struct page **pages)
+{
+   int nr_pinned;
+
+   /*
+* FOLL_GET and FOLL_PIN are mutually exclusive. Note that the API
+* rules require returning 0, rather than -errno:
+*/
+   if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+   return 0;
+   /*
+* FOLL_FAST_ONLY is required in order to match the API description of
+* this routine: no fall back to regular ("slow") GUP.
+*/
+   gup_flags |= (FOLL_PIN | FOLL_FAST_ONLY);
+   nr_pinned = internal_get_user_pages_fast(start, nr_pages, gup_flags,
+pages);
+   /*
+* This routine is not allowed to return negative values. However,
+* internal_get_user_pages_fast() *can* return -errno. Therefore,
+* correct for that here:
+*/
+   if (nr_pinned < 0)
+   nr_pinned = 0;
+
+   return nr_pinned;
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast_only);
+
 /**
  * pin_user_pages_remote() - pin pages of a remote process (task != current)
  *
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 1/4] mm/gup: move __get_user_pages_fast() down a few lines in gup.c

2020-05-18 Thread John Hubbard

This is in order to avoid a forward declaration of
internal_get_user_pages_fast(), in the next patch.

This is code movement only--all generated code should
be identical.

Signed-off-by: John Hubbard 
---
 mm/gup.c | 112 +++
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 50cd9323efff..4502846d57f9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2666,62 +2666,6 @@ static bool gup_fast_permitted(unsigned long start, 
unsigned long end)
 }
 #endif
 
-/*
- * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back 
to
- * the regular GUP.
- * Note a difference with get_user_pages_fast: this always returns the
- * number of pages pinned, 0 if no pages were pinned.
- *
- * If the architecture does not support this function, simply return with no
- * pages pinned.
- */
-int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
- struct page **pages)
-{
-   unsigned long len, end;
-   unsigned long flags;
-   int nr_pinned = 0;
-   /*
-* Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
-* because gup fast is always a "pin with a +1 page refcount" request.
-*/
-   unsigned int gup_flags = FOLL_GET;
-
-   if (write)
-   gup_flags |= FOLL_WRITE;
-
-   start = untagged_addr(start) & PAGE_MASK;
-   len = (unsigned long) nr_pages << PAGE_SHIFT;
-   end = start + len;
-
-   if (end <= start)
-   return 0;
-   if (unlikely(!access_ok((void __user *)start, len)))
-   return 0;
-
-   /*
-* Disable interrupts.  We use the nested form as we can already have
-* interrupts disabled by get_futex_key.
-*
-* With interrupts disabled, we block page table pages from being
-* freed from under us. See struct mmu_table_batch comments in
-* include/asm-generic/tlb.h for more details.
-*
-* We do not adopt an rcu_read_lock(.) here as we also want to
-* block IPIs that come from THPs splitting.
-*/
-
-   if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
-   gup_fast_permitted(start, end)) {
-   local_irq_save(flags);
-   gup_pgd_range(start, end, gup_flags, pages, _pinned);
-   local_irq_restore(flags);
-   }
-
-   return nr_pinned;
-}
-EXPORT_SYMBOL_GPL(__get_user_pages_fast);
-
 static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
   unsigned int gup_flags, struct page **pages)
 {
@@ -2794,6 +2738,62 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
return ret;
 }
 
+/*
+ * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back 
to
+ * the regular GUP.
+ * Note a difference with get_user_pages_fast: this always returns the
+ * number of pages pinned, 0 if no pages were pinned.
+ *
+ * If the architecture does not support this function, simply return with no
+ * pages pinned.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+ struct page **pages)
+{
+   unsigned long len, end;
+   unsigned long flags;
+   int nr_pinned = 0;
+   /*
+* Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
+* because gup fast is always a "pin with a +1 page refcount" request.
+*/
+   unsigned int gup_flags = FOLL_GET;
+
+   if (write)
+   gup_flags |= FOLL_WRITE;
+
+   start = untagged_addr(start) & PAGE_MASK;
+   len = (unsigned long) nr_pages << PAGE_SHIFT;
+   end = start + len;
+
+   if (end <= start)
+   return 0;
+   if (unlikely(!access_ok((void __user *)start, len)))
+   return 0;
+
+   /*
+* Disable interrupts.  We use the nested form as we can already have
+* interrupts disabled by get_futex_key.
+*
+* With interrupts disabled, we block page table pages from being
+* freed from under us. See struct mmu_table_batch comments in
+* include/asm-generic/tlb.h for more details.
+*
+* We do not adopt an rcu_read_lock(.) here as we also want to
+* block IPIs that come from THPs splitting.
+*/
+
+   if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
+   gup_fast_permitted(start, end)) {
+   local_irq_save(flags);
+   gup_pgd_range(start, end, gup_flags, pages, _pinned);
+   local_irq_restore(flags);
+   }
+
+   return nr_pinned;
+}
+EXPORT_SYMBOL_GPL(__get_user_pages_fast);
+
 /**
  * get_user_pages_fast() - pin user pages in memory
  * @start:  starting user address
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 0/4] mm/gup, drm/i915: refactor gup_fast, convert to pin_user_pages()

2020-05-18 Thread John Hubbard

This needs to go through Andrew's -mm tree, due to adding a new gup.c
routine. However, I would really love to have some testing from the
drm/i915 folks, because I haven't been able to run-time test that part
of it.

Otherwise, though, the series has passed my basic run time testing:
some LTP tests, some xfs and etx4 non-destructive xfstests, and an
assortment of other smaller ones: vm selftests, io_uring_register, a
few more. But that's only on one particular machine. Also, cross-compile
tests for half a dozen arches all pass.

Details:

In order to convert the drm/i915 driver from get_user_pages() to
pin_user_pages(), a FOLL_PIN equivalent of __get_user_pages_fast() was
required. That led to refactoring __get_user_pages_fast(), with the
following goals:

1) As above: provide a pin_user_pages*() routine for drm/i915 to call,
   in place of __get_user_pages_fast(),

2) Get rid of the gup.c duplicate code for walking page tables with
   interrupts disabled. This duplicate code is a minor maintenance
   problem anyway.

3) Make it easy for an upcoming patch from Souptick, which aims to
   convert __get_user_pages_fast() to use a gup_flags argument, instead
   of a bool writeable arg.  Also, if this series looks good, we can
   ask Souptick to change the name as well, to whatever the consensus
   is. My initial recommendation is: get_user_pages_fast_only(), to
   match the new pin_user_pages_only().

John Hubbard (4):
  mm/gup: move __get_user_pages_fast() down a few lines in gup.c
  mm/gup: refactor and de-duplicate gup_fast() code
  mm/gup: introduce pin_user_pages_fast_only()
  drm/i915: convert get_user_pages() --> pin_user_pages()

 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  22 +--
 include/linux/mm.h  |   3 +
 mm/gup.c| 150 
 3 files changed, 107 insertions(+), 68 deletions(-)


base-commit: 642b151f45dd54809ea00ecd3976a56c1ec9b53d
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 4/4] drm/i915: convert get_user_pages() --> pin_user_pages()

2020-05-18 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 22 -
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 7ffd7afeb7a5..b55ac7563189 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -471,7 +471,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
down_read(>mmap_sem);
locked = 1;
}
-   ret = get_user_pages_remote
+   ret = pin_user_pages_remote
(work->task, mm,
 obj->userptr.ptr + pinned * PAGE_SIZE,
 npages - pinned,
@@ -507,7 +507,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
}
mutex_unlock(>mm.lock);
 
-   release_pages(pvec, pinned);
+   unpin_user_pages(pvec, pinned);
kvfree(pvec);
 
i915_gem_object_put(obj);
@@ -564,6 +564,7 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
struct sg_table *pages;
bool active;
int pinned;
+   unsigned int gup_flags = 0;
 
/* If userspace should engineer that these pages are replaced in
 * the vma between us binding this page into the GTT and completion
@@ -598,11 +599,14 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
  GFP_KERNEL |
  __GFP_NORETRY |
  __GFP_NOWARN);
-   if (pvec) /* defer to worker if malloc fails */
-   pinned = __get_user_pages_fast(obj->userptr.ptr,
-  num_pages,
-  
!i915_gem_object_is_readonly(obj),
-  pvec);
+   /* defer to worker if malloc fails */
+   if (pvec) {
+   if (!i915_gem_object_is_readonly(obj))
+   gup_flags |= FOLL_WRITE;
+   pinned = pin_user_pages_fast_only(obj->userptr.ptr,
+ num_pages, gup_flags,
+ pvec);
+   }
}
 
active = false;
@@ -620,7 +624,7 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
__i915_gem_userptr_set_active(obj, true);
 
if (IS_ERR(pages))
-   release_pages(pvec, pinned);
+   unpin_user_pages(pvec, pinned);
kvfree(pvec);
 
return PTR_ERR_OR_ZERO(pages);
@@ -675,7 +679,7 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
}
 
mark_page_accessed(page);
-   put_page(page);
+   unpin_user_page(page);
}
obj->mm.dirty = false;
 
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH 2/4] mm/gup: refactor and de-duplicate gup_fast() code

2020-05-18 Thread John Hubbard

There were two nearly identical sets of code for gup_fast()
style of walking the page tables with interrupts disabled.
This has lead to the usual maintenance problems that arise from
having duplicated code.

There is already a core internal routine in gup.c for gup_fast(),
so just enhance it very slightly: allow skipping the fall-back
to "slow" (regular) get_user_pages(), via the new FOLL_FAST_ONLY
flag. Then, just call internal_get_user_pages_fast() from
__get_user_pages_fast(), and adjust the API to match pre-existing
API behavior.

There is a change in behavior from this refactoring: the nested
form of interrupt disabling is used in all gup_fast() variants
now. That's because there is only one place that interrupt disabling
for page walking is done, and so the safer form is required. This
should, if anything, eliminate possible (rare) bugs, because the
non-nested form of enabling interrupts was fragile at best.

Signed-off-by: John Hubbard 
---
 include/linux/mm.h |  1 +
 mm/gup.c   | 60 ++
 2 files changed, 29 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5594ac9ebe3..84b601cab699 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2782,6 +2782,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_LONGTERM  0x1 /* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
+#define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 4502846d57f9..bb3e2c4288c3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2694,6 +2694,7 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
struct page **pages)
 {
unsigned long addr, len, end;
+   unsigned long flags;
int nr_pinned = 0, ret = 0;
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
@@ -2710,15 +2711,26 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
if (unlikely(!access_ok((void __user *)start, len)))
return -EFAULT;
 
+   /*
+* Disable interrupts. The nested form is used, in order to allow full,
+* general purpose use of this routine.
+*
+* With interrupts disabled, we block page table pages from being
+* freed from under us. See struct mmu_table_batch comments in
+* include/asm-generic/tlb.h for more details.
+*
+* We do not adopt an rcu_read_lock(.) here as we also want to
+* block IPIs that come from THPs splitting.
+*/
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
-   local_irq_disable();
+   local_irq_save(flags);
gup_pgd_range(addr, end, gup_flags, pages, _pinned);
-   local_irq_enable();
+   local_irq_restore(flags);
ret = nr_pinned;
}
 
-   if (nr_pinned < nr_pages) {
+   if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) {
/* Try to get the remaining pages with get_user_pages */
start += nr_pinned << PAGE_SHIFT;
pages += nr_pinned;
@@ -2750,45 +2762,29 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages)
 {
-   unsigned long len, end;
-   unsigned long flags;
-   int nr_pinned = 0;
+   int nr_pinned;
/*
 * Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
 * because gup fast is always a "pin with a +1 page refcount" request.
+*
+* FOLL_FAST_ONLY is required in order to match the API description of
+* this routine: no fall back to regular ("slow") GUP.
 */
-   unsigned int gup_flags = FOLL_GET;
+   unsigned int gup_flags = FOLL_GET | FOLL_FAST_ONLY;
 
if (write)
gup_flags |= FOLL_WRITE;
 
-   start = untagged_addr(start) & PAGE_MASK;
-   len = (unsigned long) nr_pages << PAGE_SHIFT;
-   end = start + len;
-
-   if (end <= start)
-   return 0;
-   if (unlikely(!access_ok((void __user *)start, len)))
-   return 0;
-
+   nr_pinned = internal_get_user_pages_fast(start, nr_pages, gup_flags,
+pages);
/*
-* Disable interrupts.  We use the nested form as we can already have
-* interrupts disabled by get_futex_key.
-*
-* With interr

[PATCH] drm/etnaviv: convert get_user_pages() --> pin_user_pages()

2020-05-17 Thread John Hubbard

This code was using get_user_pages*(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages*() + put_page() calls to
pin_user_pages*() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Signed-off-by: John Hubbard 
---

Hi,

Note that I have only compile-tested this patch, although that does
also include cross-compiling for a few other arches.

thanks,
John Hubbard
NVIDIA

 drivers/gpu/drm/etnaviv/etnaviv_gem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index dc9ef302f517..0f4578dc169d 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -675,10 +675,10 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
uint64_t ptr = userptr->ptr + pinned * PAGE_SIZE;
struct page **pages = pvec + pinned;
 
-   ret = get_user_pages_fast(ptr, num_pages,
+   ret = pin_user_pages_fast(ptr, num_pages,
  !userptr->ro ? FOLL_WRITE : 0, pages);
if (ret < 0) {
-   release_pages(pvec, pinned);
+   unpin_user_pages(pvec, pinned);
kvfree(pvec);
return ret;
}
@@ -702,7 +702,7 @@ static void etnaviv_gem_userptr_release(struct 
etnaviv_gem_object *etnaviv_obj)
if (etnaviv_obj->pages) {
int npages = etnaviv_obj->base.size >> PAGE_SHIFT;
 
-   release_pages(etnaviv_obj->pages, npages);
+   unpin_user_pages(etnaviv_obj->pages, npages);
kvfree(etnaviv_obj->pages);
}
 }
-- 
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm v2 5/5] mm/hmm: remove the customizable pfn format from hmm_range_fault

2020-05-04 Thread John Hubbard

eems like we could get stuck in a loop here,
if we're not issuing a new REQ, right?



if (ret == -EBUSY)
continue;
return ret;
@@ -562,7 +587,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
break;
}
  
-	nouveau_dmem_convert_pfn(drm, );

+   nouveau_hmm_convert_pfn(drm, , ioctl_addr);
  
  	svmm->vmm->vmm.object.client->super = true;

ret = nvif_object_ioctl(>vmm->vmm.object, data, size, NULL);
@@ -589,6 +614,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
} i;
u64 phys[16];
} args;
+   unsigned long hmm_pfns[ARRAY_SIZE(args.phys)];



Is there a risk of blowing up the stack here?

...


--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -19,45 +19,45 @@
  #include 
  
  /*

- * hmm_pfn_flag_e - HMM flag enums
+ * On output:
+ * 0 - The page is faultable and a future call with
+ * HMM_PFN_REQ_FAULT could succeed.
+ * HMM_PFN_VALID - the pfn field points to a valid PFN. This PFN is at
+ * least readable. If dev_private_owner is !NULL then this 
could
+ * point at a DEVICE_PRIVATE page.
+ * HMM_PFN_WRITE - if the page memory can be written to (requires 
HMM_PFN_VALID)
+ * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
+ * fail. ie poisoned memory, special pages, no vma, etc
   *
- * Flags:
- * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
- * HMM_PFN_WRITE: CPU page table has write permission set
- *
- * The driver provides a flags array for mapping page protections to device
- * PTE bits. If the driver valid bit for an entry is bit 3,
- * i.e., (entry & (1 << 3)), then the driver must provide
- * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
- * Same logic apply to all flags. This is the same idea as vm_page_prot in vma
- * except that this is per device driver rather than per architecture.
+ * On input:
+ * 0 - Return the current state of the page, do not fault it.
+ * HMM_PFN_REQ_FAULT - The output must have HMM_PFN_VALID or hmm_range_fault()
+ * will fail
+ * HMM_PFN_REQ_WRITE - The output must have HMM_PFN_WRITE or hmm_range_fault()
+ * will fail. Must be combined with HMM_PFN_REQ_FAULT.
   */
-enum hmm_pfn_flag_e {
-   HMM_PFN_VALID = 0,
-   HMM_PFN_WRITE,
-   HMM_PFN_FLAG_MAX
+enum hmm_pfn_flags {


Let's add:

/* Output flags: */


+   HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
+   HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
+   HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+


/* Input flags: */

...


@@ -174,44 +162,44 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned 
long end,
}
if (required_fault)
return hmm_vma_fault(addr, end, required_fault, walk);
-   return hmm_pfns_fill(addr, end, range, HMM_PFN_NONE);
+   return hmm_pfns_fill(addr, end, range, 0);
  }
  
-static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)

+static inline unsigned long pmd_to_hmm_pfn_flags(struct hmm_range *range,
+pmd_t pmd)
  {
if (pmd_protnone(pmd))
return 0;
-   return pmd_write(pmd) ? range->flags[HMM_PFN_VALID] |
-   range->flags[HMM_PFN_WRITE] :
-   range->flags[HMM_PFN_VALID];
+   return pmd_write(pmd) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;



I always found the previous range->flags[...] approach hard to remember, so it's
nice to see a simpler version now.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm v2 4/5] mm/hmm: remove HMM_PFN_SPECIAL

2020-05-04 Thread John Hubbard


On 2020-05-01 11:20, Jason Gunthorpe wrote:

From: Jason Gunthorpe 

This is just an alias for HMM_PFN_ERROR, nothing cares that the error was
because of a special page vs any other error case.


Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


Acked-by: Felix Kuehling 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jason Gunthorpe 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 -
  drivers/gpu/drm/nouveau/nouveau_svm.c   | 1 -
  include/linux/hmm.h | 8 
  mm/hmm.c| 2 +-
  4 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 41ae7f96f48194..76b4a4fa39ed04 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -775,7 +775,6 @@ static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
  static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
0xfffeUL, /* HMM_PFN_ERROR */
0, /* HMM_PFN_NONE */
-   0xfffcUL /* HMM_PFN_SPECIAL */
  };
  
  /**

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index c68e9317cf0740..cf0d9bd61bebf9 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -379,7 +379,6 @@ static const u64
  nouveau_svm_pfn_values[HMM_PFN_VALUE_MAX] = {
[HMM_PFN_ERROR  ] = ~NVIF_VMM_PFNMAP_V0_V,
[HMM_PFN_NONE   ] =  NVIF_VMM_PFNMAP_V0_NONE,
-   [HMM_PFN_SPECIAL] = ~NVIF_VMM_PFNMAP_V0_V,
  };
  
  /* Issue fault replay for GPU to retry accesses that faulted previously. */

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0df27dd03d53d7..81c302c884c0e3 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -44,10 +44,6 @@ enum hmm_pfn_flag_e {
   * Flags:
   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
   * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
- * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
- *  result of vmf_insert_pfn() or vm_insert_page(). Therefore, it should 
not
- *  be mirrored by a device, because the entry will never have 
HMM_PFN_VALID
- *  set and the pfn value is undefined.
   *
   * Driver provides values for none entry, error entry, and special entry.
   * Driver can alias (i.e., use same value) error and special, but
@@ -56,12 +52,10 @@ enum hmm_pfn_flag_e {
   * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
   * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
   * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table entry,
- * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
   */
  enum hmm_pfn_value_e {
HMM_PFN_ERROR,
HMM_PFN_NONE,
-   HMM_PFN_SPECIAL,
HMM_PFN_VALUE_MAX
  };
  
@@ -110,8 +104,6 @@ static inline struct page *hmm_device_entry_to_page(const struct hmm_range *rang

return NULL;
if (entry == range->values[HMM_PFN_ERROR])
return NULL;
-   if (entry == range->values[HMM_PFN_SPECIAL])
-   return NULL;
if (!(entry & range->flags[HMM_PFN_VALID]))
return NULL;
return pfn_to_page(entry >> range->pfn_shift);
diff --git a/mm/hmm.c b/mm/hmm.c
index f06bcac948a79b..2e975eedb14f89 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -301,7 +301,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
pte_unmap(ptep);
return -EFAULT;
}
-   *pfn = range->values[HMM_PFN_SPECIAL];
+   *pfn = range->values[HMM_PFN_ERROR];
return 0;
}
  



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm v2 2/5] mm/hmm: make hmm_range_fault return 0 or -1

2020-05-04 Thread John Hubbard


On 2020-05-01 11:20, Jason Gunthorpe wrote:

From: Jason Gunthorpe 

hmm_vma_walk->last is supposed to be updated after every write to the
pfns, so that it can be returned by hmm_range_fault(). However, this is
not done consistently. Fortunately nothing checks the return code of
hmm_range_fault() for anything other than error.

More importantly last must be set before returning -EBUSY as it is used to
prevent reading an output pfn as an input flags when the loop restarts.

For clarity and simplicity make hmm_range_fault() return 0 or -ERRNO. Only
set last when returning -EBUSY.


Yes, this is also a nice simplification.


...
@@ -590,10 +580,13 @@ long hmm_range_fault(struct hmm_range *range)
return -EBUSY;
ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
  _walk_ops, _vma_walk);
+   /*
+* When -EBUSY is returned the loop restarts with
+* hmm_vma_walk.last set to an address that has not been stored
+* in pfns. All entries < last in the pfn array are set to their
+* output, and all >= are still at their input values.
+*/


I'm glad you added that comment. This is much easier to figure out with
that in place. After poking around this patch and eventually understanding the
.last handling, I wondered if you might like this slightly tweaked wording
instead:

/*
 * Each of the hmm_walk_ops routines returns -EBUSY if and only
 * hmm_vma_walk.last has been set to an address that has not yet
 * been stored in pfns. All entries < last in the pfn array are
 * set to their output, and all >= are still at their input
 * values.
 */

Either way,

    Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


} while (ret == -EBUSY);
-
-   if (ret)
-   return ret;
-   return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+   return ret;
  }
  EXPORT_SYMBOL(hmm_range_fault);



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm v2 1/5] mm/hmm: make CONFIG_DEVICE_PRIVATE into a select

2020-05-01 Thread John Hubbard


On 2020-05-01 11:20, Jason Gunthorpe wrote:

From: Jason Gunthorpe 

There is no reason for a user to select this or not directly - it should
be selected by drivers that are going to use the feature, similar to how
CONFIG_HMM_MIRROR works.


Yes, this is a nice touch.

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA



Currently all drivers provide a feature kconfig that will disable use of
DEVICE_PRIVATE in that driver, allowing users to avoid enabling this if
they don't want the overhead.

Acked-by: Felix Kuehling 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jason Gunthorpe 
---
  arch/powerpc/Kconfig| 2 +-
  drivers/gpu/drm/nouveau/Kconfig | 2 +-
  mm/Kconfig  | 7 +--
  3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 924c541a926008..8de52aefdc74cc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -455,7 +455,7 @@ config PPC_TRANSACTIONAL_MEM
  config PPC_UV
bool "Ultravisor support"
depends on KVM_BOOK3S_HV_POSSIBLE
-   depends on DEVICE_PRIVATE
+   select DEVICE_PRIVATE
default n
help
  This option paravirtualizes the kernel to run in POWER platforms that
diff --git a/drivers/gpu/drm/nouveau/Kconfig b/drivers/gpu/drm/nouveau/Kconfig
index d6e4ae1ef7053a..af5793f3e7c2cf 100644
--- a/drivers/gpu/drm/nouveau/Kconfig
+++ b/drivers/gpu/drm/nouveau/Kconfig
@@ -86,10 +86,10 @@ config DRM_NOUVEAU_BACKLIGHT
  
  config DRM_NOUVEAU_SVM

bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
-   depends on DEVICE_PRIVATE
depends on DRM_NOUVEAU
depends on MMU
depends on STAGING
+   select DEVICE_PRIVATE
select HMM_MIRROR
select MMU_NOTIFIER
default n
diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358c..7ca36bf5f5058e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -805,15 +805,10 @@ config HMM_MIRROR
depends on MMU
  
  config DEVICE_PRIVATE

-   bool "Unaddressable device memory (GPU memory, ...)"
+   bool
depends on ZONE_DEVICE
select DEV_PAGEMAP_OPS
  
-	help

- Allows creation of struct pages to represent unaddressable device
- memory; i.e., memory that is only accessible from the device (or
- group of devices). You likely also want to select HMM_MIRROR.
-
  config FRAME_VECTOR
bool
  



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Nouveau] [PATCH] nouveau: no need to check return value of debugfs_create functions

2020-02-13 Thread John Hubbard

On 2/13/20 2:39 PM, Greg Kroah-Hartman wrote:
> On Thu, Feb 13, 2020 at 02:30:09PM -0800, John Hubbard wrote:
>> On 2/9/20 2:55 AM, Greg Kroah-Hartman wrote:
>>> When calling debugfs functions, there is no need to ever check the
>>> return value.  The function can work or not, but the code logic should
>>> never do something different based on this.
>>>
>>
>> Should we follow that line of reasoning further, and simply return void
>> from the debugfs functions--rather than playing whack-a-mole with this
>> indefinitely?
> 
> That is what we (well I) have been doing.  Look at all of the changes
> that have happened to include/linux/debugfs.h over the past few
> releases.  I'm slowly winnowing down the api to make it impossible to
> get wrong for this type of thing, and am almost there.
>


Oops, I see now that you have already been very busy with this. :)  
Looking good...


thanks,
-- 
John Hubbard
NVIDIA
 
> DRM is the big fish left to tackle, I have submitted some patches in the
> past, but lots more cleanup needs to be done to get them into mergable
> shape.  I just need to find the time...
>>
> thanks,
> 
> greg k-h
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Nouveau] [PATCH] nouveau: no need to check return value of debugfs_create functions

2020-02-13 Thread John Hubbard

On 2/9/20 2:55 AM, Greg Kroah-Hartman wrote:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 

Should we follow that line of reasoning further, and simply return void
from the debugfs functions--rather than playing whack-a-mole with this
indefinitely?


thanks,
-- 
John Hubbard
NVIDIA

> Cc: Ben Skeggs 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: dri-devel@lists.freedesktop.org
> Cc: nouv...@lists.freedesktop.org
> Signed-off-by: Greg Kroah-Hartman 
> ---
>  drivers/gpu/drm/nouveau/nouveau_debugfs.c | 12 
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_debugfs.c 
> b/drivers/gpu/drm/nouveau/nouveau_debugfs.c
> index 080e964d49aa..d1c82fc45a68 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_debugfs.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_debugfs.c
> @@ -224,14 +224,10 @@ nouveau_drm_debugfs_init(struct drm_minor *minor)
>   struct dentry *dentry;
>   int i;
>  
> - for (i = 0; i < ARRAY_SIZE(nouveau_debugfs_files); i++) {
> - dentry = debugfs_create_file(nouveau_debugfs_files[i].name,
> -  S_IRUGO | S_IWUSR,
> -  minor->debugfs_root, minor->dev,
> -  nouveau_debugfs_files[i].fops);
> - if (!dentry)
> - return -ENOMEM;
> - }
> + for (i = 0; i < ARRAY_SIZE(nouveau_debugfs_files); i++)
> + debugfs_create_file(nouveau_debugfs_files[i].name,
> + S_IRUGO | S_IWUSR, minor->debugfs_root,
> + minor->dev, nouveau_debugfs_files[i].fops);
>  
>   drm_debugfs_create_files(nouveau_debugfs_list,
>NOUVEAU_DEBUGFS_ENTRIES,
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v12 04/22] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2020-01-16 Thread John Hubbard


On 1/16/20 1:37 AM, Christoph Hellwig wrote:

On Wed, Jan 15, 2020 at 01:19:41PM -0800, John Hubbard wrote:

On 1/15/20 7:23 AM, Christoph Hellwig wrote:
...


I'm really not sold on this scheme.  Note that I think it is
particularly bad, but it also doesn't seem any better than what
we had before, and it introduced quite a bit more code.



Hi Christoph,

All by itself, yes. But the very next patch (which needs a little
rework for other reasons, so not included here) needs to reuse some of
these functions within __unpin_devmap_managed_user_page():


Well, then combine it with the series that actually does the change.



OK, that makes sense. I just double-checked with a quick test run, that it
doesn't have dependencies with the rest of this series, and it came out clean,
so:

Andrew, could you please remove just this one patch from mmotm and linux-next?




Also my vaguely recollection is that we had some idea on how to get rid
of the off by one refcounting for the zone device pages, which would be
a much better outcome.



Yes, I recall that Dan Williams mentioned it, but I don't think he provided
any details yet.


thanks,
--
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v12 11/22] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2020-01-15 Thread John Hubbard

On 1/15/20 7:30 AM, Christoph Hellwig wrote:
> On Tue, Jan 07, 2020 at 02:45:47PM -0800, John Hubbard wrote:
>> Introduce pin_user_pages*() variations of get_user_pages*() calls,
>> and also pin_longterm_pages*() variations.
>>
>> For now, these are placeholder calls, until the various call sites
>> are converted to use the correct get_user_pages*() or
>> pin_user_pages*() API.
> 
> What do the pure placeholders buy us?  The API itself looks ok,
> but until it actually is properly implemented it doesn't help at
> all, and we've had all kinds of bad experiences with these sorts
> of stub APIs.
> 

Hi Christoph,

Absolutely agreed, and in fact, after spending some time in this area I 
am getting a much better understanding of just how problematic "this will 
be used soon" APIs really are. However, this is not quite that case.

The following things make this different from a "pure placeholder" API:

1) These APIs are all exercised in the following patches in this series, 
unless I've overlooked one, and

2) The pages are actually tracked in the very next patch that I want to
post. That patch was posted as part of the v11 series [1], but 
Leon Romanovsky reported a problem with it, and so I'm going to add in
the ability to handle larger "pin" refcounts for the huge page cases.

Meanwhile, I wanted to get these long-simmering simpler preparatory
patches submitted, because it's clear that the API is unaffected by the
huge page refcount fix. (That fix will likely use the second struct page of
the compound page, to count up higher.)

[1] https://lore.kernel.org/r/20191216222537.491123-24-jhubb...@nvidia.com  
[PATCH v11 23/25] mm/gup: track FOLL_PIN pages

thanks,
-- 
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v12 04/22] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2020-01-15 Thread John Hubbard

On 1/15/20 7:23 AM, Christoph Hellwig wrote:
...
> 
> I'm really not sold on this scheme.  Note that I think it is
> particularly bad, but it also doesn't seem any better than what
> we had before, and it introduced quite a bit more code.
> 

Hi Christoph,

All by itself, yes. But the very next patch (which needs a little 
rework for other reasons, so not included here) needs to reuse some of 
these functions within __unpin_devmap_managed_user_page():

page_is_devmap_managed()
free_devmap_managed_page()

That patch was posted as part of the v11 series [1], and it did this:

+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+   int count;
+
+   if (!page_is_devmap_managed(page))
+   return false;
+
+   count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+   __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+
+   return true;
+}
+#else
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+   return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * unpin_user_page() - release a dma-pinned page
+ * @page:pointer to page to be released
+ *
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
+ */
+void unpin_user_page(struct page *page)
+{
+   page = compound_head(page);
+
+   /*
+* For devmap managed pages we need to catch refcount transition from
+* GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+* page is free and we need to inform the device driver through
+* callback. See include/linux/memremap.h and HMM for details.
+*/
+   if (__unpin_devmap_managed_user_page(page))
+   return;
+
+   if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+   __put_page(page);
+
+   __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+}
+EXPORT_SYMBOL(unpin_user_page);


[1] https://lore.kernel.org/r/20191216222537.491123-24-jhubb...@nvidia.com  
[PATCH v11 23/25] mm/gup: track FOLL_PIN pages

thanks,
-- 
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v12 00/22] mm/gup: prereqs to track dma-pinned pages: FOLL_PIN

2020-01-14 Thread John Hubbard

On 1/9/20 2:07 PM, John Hubbard wrote:
> On 1/7/20 2:45 PM, John Hubbard wrote:
>> Hi,
>>
>> The "track FOLL_PIN pages" would have been the very next patch, but it is
>> not included here because I'm still debugging a bug report from Leon.
>> Let's get all of the prerequisite work (it's been reviewed) into the tree
>> so that future reviews are easier. It's clear that any fixes that are
>> required to the tracking patch, won't affect these patches here.
>>
>> This implements an API naming change (put_user_page*() -->
>> unpin_user_page*()), and also adds FOLL_PIN page support, up to
>> *but not including* actually tracking FOLL_PIN pages. It extends
>> the FOLL_PIN support to a few select subsystems. More subsystems will
>> be added in follow up work.
>>
> 
> Hi Andrew and all,
> 
> To clarify: I'm hoping that this series can go into 5.6.
> 
> Meanwhile, I'm working on tracking down and solving the problem that Leon
> reported, in the "track FOLL_PIN pages" patch, and that patch is not part of
> this series.
> 

Hi Andrew and all,

Any thoughts on this?

As for the not-included-yet tracking patch, my local testing still suggests the
need to allow for larger refcounts of huge pages (in other words, I can write a 
test
to pin huge pages many times, and overflow with the same backtrace that Leon has
reported).

The second struct page (I recall Jan suggested) can hold those, so I'm going to 
proceed
with that approach, while waiting to see if Leon has any more test data for me.

Again, I think this series is worth getting out of the way, in the meantime.

thanks,
-- 
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v12 00/22] mm/gup: prereqs to track dma-pinned pages: FOLL_PIN

2020-01-09 Thread John Hubbard

On 1/7/20 2:45 PM, John Hubbard wrote:
> Hi,
> 
> The "track FOLL_PIN pages" would have been the very next patch, but it is
> not included here because I'm still debugging a bug report from Leon.
> Let's get all of the prerequisite work (it's been reviewed) into the tree
> so that future reviews are easier. It's clear that any fixes that are
> required to the tracking patch, won't affect these patches here.
> 
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also adds FOLL_PIN page support, up to
> *but not including* actually tracking FOLL_PIN pages. It extends
> the FOLL_PIN support to a few select subsystems. More subsystems will
> be added in follow up work.
> 

Hi Andrew and all,

To clarify: I'm hoping that this series can go into 5.6.

Meanwhile, I'm working on tracking down and solving the problem that Leon
reported, in the "track FOLL_PIN pages" patch, and that patch is not part of
this series.

thanks,
-- 
John Hubbard
NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v12 06/22] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2020-01-07 Thread John Hubbard

As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Kirill A. Shutemov 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 174 +--
 1 file changed, 92 insertions(+), 82 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 5938e29a5a8b..b61bd5c469ae 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -,88 +,6 @@ static __always_inline long 
__get_user_pages_locked(struct task_struct *tsk,
return pages_done;
 }
 
-/*
- * get_user_pages_remote() - pin user pages in memory
- * @tsk:   the task_struct to use for page fault accounting, or
- * NULL if faults are not to be recorded.
- * @mm:mm_struct of target mm
- * @start: starting user address
- * @nr_pages:  number of pages from start to pin
- * @gup_flags: flags modifying lookup behaviour
- * @pages: array that receives pointers to the pages pinned.
- * Should be at least nr_pages long. Or NULL, if caller
- * only intends to ensure the pages are faulted in.
- * @vmas:  array of pointers to vmas corresponding to each page.
- * Or NULL if the caller does not require them.
- * @locked:pointer to lock flag indicating whether lock is held and
- * subsequently whether VM_FAULT_RETRY functionality can be
- * utilised. Lock must initially be held.
- *
- * Returns either number of pages pinned (which may be less than the
- * number requested), or an error. Details about the return value:
- *
- * -- If nr_pages is 0, returns 0.
- * -- If nr_pages is >0, but no pages were pinned, returns -errno.
- * -- If nr_pages is >0, and some pages were pinned, returns the number of
- *pages pinned. Again, this may be less than nr_pages.
- *
- * The caller is responsible for releasing returned @pages, via put_page().
- *
- * @vmas are valid only as long as mmap_sem is held.
- *
- * Must be called with mmap_sem held for read or write.
- *
- * get_user_pages walks a process's page tables and takes a reference to
- * each struct page that each user address corresponds to at a given
- * instant. That is, it takes the page that would be accessed if a user
- * thread accesses the given user virtual address at that instant.
- *
- * This does not guarantee that the page exists in the user mappings when
- * get_user_pages returns, and there may even be a completely different
- * page there in some cases (eg. if mmapped pagecache has been invalidated
- * and subsequently re faulted). However it does guarantee that the page
- * won't be freed completely. And mostly callers simply care that the page
- * contains data that was valid *at some point in time*. Typically, an IO
- * or similar operation cannot guarantee anything stronger anyway because
- * locks can't be held over the syscall boundary.
- *
- * If gup_flags & FOLL_WRITE == 0, the page must not be written to. If the page
- * is written to, set_page_dirty (or set_page_dirty_lock, as appropriate) must
- * be called after the page is finished with, and before put_page is called.
- *
- * get_user_pages is typically used for fewer-copy IO operations, to get a
- * handle on the memory by some means other than accesses via the user virtual
- * addresses. The pages may be submitted for DMA to devices or accessed via
- * their kernel linear mapping (via the kmap APIs). Care should be taken to
- * use the correct cache flushing APIs.
- *
- * See also get_user_pages_fast, for performance critical applications.
- *
- * get_user_pages should be phased out in favor of
- * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
- * should use get_user_pages because it cannot pass
- * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
- */
-long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
-   unsigned long

1 2 3 4 5 6 7 >

1 - 100 of 692 matches

Mail list logo