from:"Suren Baghdasaryan"

Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Suren Baghdasaryan

Hi Folks,

On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin  wrote:
>
> On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> > Proposal: Provide memory guarantees to userspace oom-killer.
> >
> > Background:
> >
> > Issues with kernel oom-killer:
> > 1. Very conservative and prefer to reclaim. Applications can suffer
> > for a long time.
> > 2. Borrows the context of the allocator which can be resource limited
> > (low sched priority or limited CPU quota).
> > 3. Serialized by global lock.
> > 4. Very simplistic oom victim selection policy.
> >
> > These issues are resolved through userspace oom-killer by:
> > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> > early detect suffering.
> > 2. Independent process context which can be given dedicated CPU quota
> > and high scheduling priority.
> > 3. Can be more aggressive as required.
> > 4. Can implement sophisticated business logic/policies.
> >
> > Android's LMKD and Facebook's oomd are the prime examples of userspace
> > oom-killers. One of the biggest challenges for userspace oom-killers
> > is to potentially function under intense memory pressure and are prone
> > to getting stuck in memory reclaim themselves. Current userspace
> > oom-killers aim to avoid this situation by preallocating user memory
> > and protecting themselves from global reclaim by either mlocking or
> > memory.min. However a new allocation from userspace oom-killer can
> > still get stuck in the reclaim and policy rich oom-killer do trigger
> > new allocations through syscalls or even heap.
> >
> > Our attempt of userspace oom-killer faces similar challenges.
> > Particularly at the tail on the very highly utilized machines we have
> > observed userspace oom-killer spectacularly failing in many possible
> > ways in the direct reclaim. We have seen oom-killer stuck in direct
> > reclaim throttling, stuck in reclaim and allocations from interrupts
> > keep stealing reclaimed memory. We have even observed systems where
> > all the processes were stuck in throttle_direct_reclaim() and only
> > kswapd was running and the interrupts kept stealing the memory
> > reclaimed by kswapd.
> >
> > To reliably solve this problem, we need to give guaranteed memory to
> > the userspace oom-killer. At the moment we are contemplating between
> > the following options and I would like to get some feedback.
> >
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
> >
> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
>
> Hello Shakeel!
>
> If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> the system is already in a relatively bad shape. Arguably the userspace
> OOM killer should kick in earlier, it's already a bit too late.

I tend to agree here. This is how we are trying to avoid issues with
such severe memory shortages - by tuning the killer a bit more
aggressively. But a more reliable mechanism would definitely be an
improvement.

> Allowing to use reserves just pushes this even further, so we're risking
> the kernel stability for no good reason.
>
> But I agree that throttling the oom daemon in direct reclaim makes no sense.
> I wonder if we can introduce a per-task flag which will exclude the task from
> throttling, but instead all (large) allocations will just fail under a
> significant memory pressure more easily. In this case if there is a 
> significant
> memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> for an attempt to read some stats, for example), but still will be able to 
> kill
> some processes and make the forward progress.

This sounds like a good idea to me.

> But maybe it can be done in userspace too: by splitting the daemon into
> a core- and extended part and avoid doing anything behind bare minimum
> in the core part.
>
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
> >
> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to

Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Suren Baghdasaryan

On Wed, Apr 7, 2021 at 12:23 PM Linus Torvalds
 wrote:
>
> On Wed, Apr 7, 2021 at 11:47 AM Mikulas Patocka  wrote:
> >
> > So, we fixed it, but we don't know why.
> >
> > Peter Xu's patchset that fixed it is here:
> > https://lore.kernel.org/lkml/20200821234958.7896-1-pet...@redhat.com/
>
> Yeah, that's the part that ends up being really painful to backport
> (with all the subsequent fixes too), so the 4.14 people would prefer
> to avoid it.
>
> But I think that if it's a "requires dax pmem and ptrace on top", it
> may simply be a non-issue for those users. Although who knows - maybe
> that ends up being a real issue on Android..

A lot to digest, so I need to do some reading now. Thanks everyone!

>
> Linus

Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Suren Baghdasaryan

On Wed, Apr 7, 2021 at 9:07 AM Linus Torvalds
 wrote:
>
> On Wed, Apr 7, 2021 at 6:22 AM Vlastimil Babka  wrote:
> >
> > 1) Ignore the issue (outside of Android at least). The security model of 
> > zygote
> > is unusual. Where else a parent of fork() doesn't trust the child, which is 
> > the
> > same binary?
>
> Agreed. I think this is basically an android-only issue (with
> _possibly_ some impact on crazy "pin-and-fork" loads), and doesn't
> necessarily merit a backport at all.
>
> If Android people insist on using very old kernels, knowing that they
> do things that are questionable with those old kernels, at some point
> it's just _their_ problem.

We don't really insist on using old kernels but rather we are stuck
with them for some time.
Trying my hand at backporting the patchsets Peter mentioned proved
this to be far from easy with many dependencies. Let me look into
Vlastimil's suggestion to backport only 17839856fd58 and it sounds
like 5.4 already followed that path. Thanks for all the information!
Suren.

>
>  Linus
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-01 Thread Suren Baghdasaryan

On Thu, Apr 1, 2021 at 4:47 PM Peter Xu  wrote:
>
> Hi, Suren,
>
> On Thu, Apr 01, 2021 at 12:43:51PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Apr 1, 2021 at 11:59 AM Linus Torvalds
> >  wrote:
> > >
> > > On Thu, Apr 1, 2021 at 11:17 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > We received a report that the copy-on-write issue repored by Jann Horn 
> > > > in
> > > > https://bugs.chromium.org/p/project-zero/issues/detail?id=2045 is still
> > > > reproducible on 4.14 and 4.19 kernels (the first issue with the 
> > > > reproducer
> > > > coded in vmsplice.c).
> > >
> > > Gaah.
> > >
> > > > I confirmed this and also that the issue was not
> > > > reproducible with 5.10 kernel. I tracked the fix to the following patch
> > > > introduced in 5.9 which changes the do_wp_page() logic:
> > > >
> > > > 09854ba94c6a 'mm: do_wp_page() simplification'
> > >
> > > The problem here is that there's a _lot_ more patches than the few you
> > > found that fixed various other cases (THP etc).
> > >
> > > > I backported this patch (#2 in the series) along with 2 prerequisite 
> > > > patches
> > > > (#1 and #4) that keep the backports clean and two followup fixes to the 
> > > > main
> > > > patch (#3 and #5). I had to skip the following fix:
> > > >
> > > > feb889fb40fa 'mm: don't put pinned pages into the swap cache'
> > > >
> > > > because it uses page_maybe_dma_pinned() which does not exists in earlier
> > > > kernels. Because pin_user_pages() does not exist there as well, I 
> > > > *think*
> > > > we can safely skip this fix on older kernels, but I would appreciate if
> > > > someone could confirm that claim.
> > >
> > > Hmm. I think this means that swap activity can now break the
> > > connection to a GUP page (the whole pre-pinning model), but it
> > > probably isn't a new problem for 4.9/4.19.
> > >
> > > I suspect the test there should be something like
> > >
> > > /* Single mapper, more references than us and the map? */
> > > if (page_mapcount(page) == 1 && page_count(page) > 2)
> > > goto keep_locked;
> > >
> > > in the pre-pinning days.
> > >
> > > But I really think that there are a number of other commits you're
> > > missing too, because we had a whole series for THP fixes for the same
> > > exact issue.
> > >
> > > Added Peter Xu to the cc, because he probably tracked those issues
> > > better than I did.
> > >
> > > So NAK on this for now, I think this limited patch-set likely
> > > introduces more problems than it fixes.
> >
> > Thanks for confirming my worries. I'll be happy to add additional
> > backports if Peter can point me to them.
>
> If for a full-alignment with current upstream, I can at least think of below
> series:
>
> Early cow for general pages:
> https://lore.kernel.org/lkml/20200925222600.6832-1-pet...@redhat.com/
>
> A race fix for copy_page and gup-fast:
> https://lore.kernel.org/linux-mm/0-v4-908497cf359a+4782-gup_fork_...@nvidia.com/
>
> Early cow for hugetlbfs (which is very recently):
> https://lore.kernel.org/lkml/20210217233547.93892-1-pet...@redhat.com/
>
> But I believe they'll bring a number of dependencies too like the page pinned
> work; so seems not easy.

Thanks Peter. Let me try backporting these and I'll see if it's doable.

>
> Btw, AFAICT you don't need patch 4/5 in this series for 4.14/4.19, since
> those're only for uffd-wp and it doesn't exist until 5.7.

Got it. Will drop it from the next series.
Thanks,
Suren.

>
> Thanks,
>
> --
> Peter Xu
>

Re: [PATCH 1/5] mm: reuse only-pte-mapped KSM page in do_wp_page()

2021-04-01 Thread Suren Baghdasaryan

On Thu, Apr 1, 2021 at 12:38 PM Greg KH  wrote:
>
> On Thu, Apr 01, 2021 at 11:17:37AM -0700, Suren Baghdasaryan wrote:
> > From: Kirill Tkhai 
> >
> > Add an optimization for KSM pages almost in the same way that we have
> > for ordinary anonymous pages.  If there is a write fault in a page,
> > which is mapped to an only pte, and it is not related to swap cache; the
> > page may be reused without copying its content.
> >
> > [ Note that we do not consider PageSwapCache() pages at least for now,
> >   since we don't want to complicate __get_ksm_page(), which has nice
> >   optimization based on this (for the migration case). Currenly it is
> >   spinning on PageSwapCache() pages, waiting for when they have
> >   unfreezed counters (i.e., for the migration finish). But we don't want
> >   to make it also spinning on swap cache pages, which we try to reuse,
> >   since there is not a very high probability to reuse them. So, for now
> >   we do not consider PageSwapCache() pages at all. ]
> >
> > So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
> > page_stable_node(), to skip a page, which KSM is currently trying to
> > link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
> > merge one more page into the page, we are reusing.  After that, nobody
> > can refer to the reusing page: KSM skips !PageSwapCache() pages with
> > zero refcount; and the protection against of all other participants is
> > the same as for reused ordinary anon pages pte lock, page lock and
> > mmap_sem.
> >
> > [a...@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
> > Link: 
> > http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
> > Signed-off-by: Kirill Tkhai 
> > Reviewed-by: Yang Shi 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Hugh Dickins 
> > Cc: Andrea Arcangeli 
> > Cc: Christian Koenig 
> > Cc: Claudio Imbrenda 
> > Cc: Rik van Riel 
> > Cc: Huang Ying 
> > Cc: Minchan Kim 
> > Cc: Kirill Tkhai 
> > Signed-off-by: Andrew Morton 
> > Signed-off-by: Linus Torvalds 
> > ---
> >  include/linux/ksm.h |  7 +++
> >  mm/ksm.c| 30 --
> >  mm/memory.c | 16 ++--
> >  3 files changed, 49 insertions(+), 4 deletions(-)
>
> You forgot to put the git commit id of the upstream commit in here
> somewhere so we can properly reference it and track it.
>
> When/if you resend this, please add it to all of the commits.

Will do. Thanks!

>
> thanks,
>
> greg k-h

Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-01 Thread Suren Baghdasaryan

On Thu, Apr 1, 2021 at 11:59 AM Linus Torvalds
 wrote:
>
> On Thu, Apr 1, 2021 at 11:17 AM Suren Baghdasaryan  wrote:
> >
> > We received a report that the copy-on-write issue repored by Jann Horn in
> > https://bugs.chromium.org/p/project-zero/issues/detail?id=2045 is still
> > reproducible on 4.14 and 4.19 kernels (the first issue with the reproducer
> > coded in vmsplice.c).
>
> Gaah.
>
> > I confirmed this and also that the issue was not
> > reproducible with 5.10 kernel. I tracked the fix to the following patch
> > introduced in 5.9 which changes the do_wp_page() logic:
> >
> > 09854ba94c6a 'mm: do_wp_page() simplification'
>
> The problem here is that there's a _lot_ more patches than the few you
> found that fixed various other cases (THP etc).
>
> > I backported this patch (#2 in the series) along with 2 prerequisite patches
> > (#1 and #4) that keep the backports clean and two followup fixes to the main
> > patch (#3 and #5). I had to skip the following fix:
> >
> > feb889fb40fa 'mm: don't put pinned pages into the swap cache'
> >
> > because it uses page_maybe_dma_pinned() which does not exists in earlier
> > kernels. Because pin_user_pages() does not exist there as well, I *think*
> > we can safely skip this fix on older kernels, but I would appreciate if
> > someone could confirm that claim.
>
> Hmm. I think this means that swap activity can now break the
> connection to a GUP page (the whole pre-pinning model), but it
> probably isn't a new problem for 4.9/4.19.
>
> I suspect the test there should be something like
>
> /* Single mapper, more references than us and the map? */
> if (page_mapcount(page) == 1 && page_count(page) > 2)
> goto keep_locked;
>
> in the pre-pinning days.
>
> But I really think that there are a number of other commits you're
> missing too, because we had a whole series for THP fixes for the same
> exact issue.
>
> Added Peter Xu to the cc, because he probably tracked those issues
> better than I did.
>
> So NAK on this for now, I think this limited patch-set likely
> introduces more problems than it fixes.

Thanks for confirming my worries. I'll be happy to add additional
backports if Peter can point me to them.
Thanks,
Suren.

>
> Linus

[PATCH 5/5] mm/userfaultfd: fix memory corruption due to writeprotect

2021-04-01 Thread Suren Baghdasaryan

From: Nadav Amit 

Userfaultfd self-test fails occasionally, indicating a memory corruption.

Analyzing this problem indicates that there is a real bug since mmap_lock
is only taken for read in mwriteprotect_range() and defers flushes, and
since there is insufficient consideration of concurrent deferred TLB
flushes in wp_page_copy().  Although the PTE is flushed from the TLBs in
wp_page_copy(), this flush takes place after the copy has already been
performed, and therefore changes of the page are possible between the time
of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses a
problem.  Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural PTE
permission: when userfaultfd_writeprotect() unprotects memory region, it
unintentionally *clears* the RW-bit if it was already set.  Note that this
unprotecting a PTE that is not write-protected is a valid use-case: the
userfaultfd monitor might ask to unprotect a region that holds both
write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0cpu1cpu2

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
 cow_user_page()
  [ copy page ]
[ write to old
  page ]
...
 set_pte_at_notify()

A similar scenario can happen:

cpu0cpu1cpu2cpu3

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
 cow_user_page()
 [ copy page ]
 ...[ write to page ]
set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit").  Yet, as Yu Zhao pointed, these races became apparent
since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if page_count(page)
> 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.

Link: https://lkml.kernel.org/r/20210304095423.3825684-1-na...@vmware.com
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Nadav Amit 
Suggested-by: Yu Zhao 
Reviewed-by: Peter Xu 
Tested-by: Peter Xu 
Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Minchan Kim 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: [5.9+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
---
 mm/memory.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 656d90a75cf8..fe6e92de9bec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2825,6 +2825,14 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
+   /*
+* Userfaultfd write-protect can defer flushes. Ensure the TLB
+* is flushed in this case before copying.
+*/
+   if (unlikely(userfaultfd_wp(vmf->vma) &&
+mm_tlb_flush_pending(vmf->vma->vm_mm)))
+   flush_tlb_page(vmf->vma, vmf->address);
+
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
-- 
2.31.0.291.g576ba9dcdaf-goog

[PATCH 4/5] userfaultfd: wp: add helper for writeprotect check

2021-04-01 Thread Suren Baghdasaryan

From: Shaohua Li 

Patch series "userfaultfd: write protection support", v6.

Overview


The uffd-wp work was initialized by Shaohua Li [1], and later continued by
Andrea [2].  This series is based upon Andrea's latest userfaultfd tree,
and it is a continuous works from both Shaohua and Andrea.  Many of the
follow up ideas come from Andrea too.

Besides the old MISSING register mode of userfaultfd, the new uffd-wp
support provides another alternative register mode called
UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
page faults but also write protection page faults, or even they can be
registered together.  At the same time, the new feature also provides a
new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
userspace to write protect a range or memory or fixup write permission of
faulted pages.

Please refer to the document patch "userfaultfd: wp:
UFFDIO_REGISTER_MODE_WP documentation update" for more information on the
new interface and what it can do.

The major workflow of an uffd-wp program should be:

  1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP

  2. Write protect part of the whole registered region using
 UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
 show that we want to write protect the range.

  3. Start a working thread that modifies the protected pages,
 meanwhile listening to UFFD messages.

  4. When a write is detected upon the protected range, page fault
 happens, a UFFD message will be generated and reported to the
 page fault handling thread

  5. The page fault handler thread resolves the page fault using the
 new UFFDIO_WRITEPROTECT ioctl, but this time passing in
 !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
 recover the write permission.  Before this operation, the fault
 handler thread can do anything it wants, e.g., dumps the page to
 a persistent storage.

  6. The worker thread will continue running with the correctly
 applied write permission from step 5.

Currently there are already two projects that are based on this new
userfaultfd feature.

QEMU Live Snapshot: The project provides a way to allow the QEMU
hypervisor to take snapshot of VMs without
stopping the VM [3].

LLNL umap library:  The project provides a mmap-like interface and
"allow to have an application specific buffer of
pages cached from a large file, i.e. out-of-core
execution using memory map" [4][5].

Before posting the patchset, this series was smoke tested against QEMU
live snapshot and the LLNL umap library (by doing parallel quicksort using
128 sorting threads + 80 uffd servicing threads).  My sincere thanks to
Marty Mcfadden and Denis Plotnikov for the help along the way.

TODO


- hugetlbfs/shmem support
- performance
- more architectures
- cooperate with mprotect()-allowed processes (???)
- ...

References
==

[1] https://lwn.net/Articles/666187/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
[3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
[4] https://github.com/LLNL/umap
[5] https://llnl-umap.readthedocs.io/en/develop/
[6] 
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
[7] https://lkml.org/lkml/2018/11/21/370
[8] https://lkml.org/lkml/2018/12/30/64

This patch (of 19):

Add helper for writeprotect check. Will use it later.

Signed-off-by: Shaohua Li 
Signed-off-by: Andrea Arcangeli 
Signed-off-by: Peter Xu 
Signed-off-by: Andrew Morton 
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Bobby Powers 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Denis Plotnikov 
Cc: "Dr . David Alan Gilbert" 
Cc: Martin Cracauer 
Cc: Marty McFadden 
Cc: Maya Gokhale 
Cc: Mike Kravetz 
Cc: Pavel Emelyanov 
Link: http://lkml.kernel.org/r/20200220163112.11409-2-pet...@redhat.com
Signed-off-by: Linus Torvalds 
---
 include/linux/userfaultfd_k.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 37c9eba75c98..38f748e7186e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -50,6 +50,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -94,6 +99,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return false;
 }
 
+static inline bool userfaultfd_wp(struct

[PATCH 3/5] mm: fix misplaced unlock_page in do_wp_page()

2021-04-01 Thread Suren Baghdasaryan

From: Linus Torvalds 

Commit 09854ba94c6a ("mm: do_wp_page() simplification") reorganized all
the code around the page re-use vs copy, but in the process also moved
the final unlock_page() around to after the wp_page_reuse() call.

That normally doesn't matter - but it means that the unlock_page() is
now done after releasing the page table lock.  Again, not a big deal,
you'd think.

But it turns out that it's very wrong indeed, because once we've
released the page table lock, we've basically lost our only reference to
the page - the page tables - and it could now be free'd at any time.  We
do hold the mmap_sem, so no actual unmap() can happen, but madvise can
come in and a MADV_DONTNEED will zap the page range - and free the page.

So now the page may be free'd just as we're unlocking it, which in turn
will usually trigger a "Bad page state" error in the freeing path.  To
make matters more confusing, by the time the debug code prints out the
page state, the unlock has typically completed and everything looks fine
again.

This all doesn't happen in any normal situations, but it does trigger
with the dirtyc0w_child LTP test.  And it seems to trigger much more
easily (but not expclusively) on s390 than elsewhere, probably because
s390 doesn't do the "batch pages up for freeing after the TLB flush"
that gives the unlock_page() more time to complete and makes the race
harder to hit.

Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Link: 
https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.ca...@redhat.com/
Link: 
https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690f...@linux.alibaba.com/
Reported-by: Qian Cai 
Reported-by: Alex Shi 
Bisected-and-analyzed-by: Gerald Schaefer 
Tested-by: Gerald Schaefer 
Signed-off-by: Linus Torvalds 
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index d95a4573a273..656d90a75cf8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2863,8 +2863,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 * page count reference, and the page is locked,
 * it's dark out, and we're wearing sunglasses. Hit it.
 */
-   wp_page_reuse(vmf);
unlock_page(page);
+   wp_page_reuse(vmf);
return VM_FAULT_WRITE;
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
-- 
2.31.0.291.g576ba9dcdaf-goog

[PATCH 1/5] mm: reuse only-pte-mapped KSM page in do_wp_page()

2021-04-01 Thread Suren Baghdasaryan

From: Kirill Tkhai 

Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages.  If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.

[ Note that we do not consider PageSwapCache() pages at least for now,
  since we don't want to complicate __get_ksm_page(), which has nice
  optimization based on this (for the migration case). Currenly it is
  spinning on PageSwapCache() pages, waiting for when they have
  unfreezed counters (i.e., for the migration finish). But we don't want
  to make it also spinning on swap cache pages, which we try to reuse,
  since there is not a very high probability to reuse them. So, for now
  we do not consider PageSwapCache() pages at all. ]

So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing.  After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.

[a...@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: 
http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Reviewed-by: Yang Shi 
Cc: "Kirill A. Shutemov" 
Cc: Hugh Dickins 
Cc: Andrea Arcangeli 
Cc: Christian Koenig 
Cc: Claudio Imbrenda 
Cc: Rik van Riel 
Cc: Huang Ying 
Cc: Minchan Kim 
Cc: Kirill Tkhai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
---
 include/linux/ksm.h |  7 +++
 mm/ksm.c| 30 --
 mm/memory.c | 16 ++--
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..e48b1e453ff5 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,8 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+bool reuse_ksm_page(struct page *page,
+   struct vm_area_struct *vma, unsigned long address);
 
 #else  /* !CONFIG_KSM */
 
@@ -86,6 +88,11 @@ static inline void rmap_walk_ksm(struct page *page,
 static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
 }
+static inline bool reuse_ksm_page(struct page *page,
+   struct vm_area_struct *vma, unsigned long address)
+{
+   return false;
+}
 #endif /* CONFIG_MMU */
 #endif /* !CONFIG_KSM */
 
diff --git a/mm/ksm.c b/mm/ksm.c
index d021bcf94c41..c4e95ca65d62 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -705,8 +705,9 @@ static struct page *get_ksm_page(struct stable_node 
*stable_node, bool lock_it)
 * case this node is no longer referenced, and should be freed;
 * however, it might mean that the page is under page_ref_freeze().
 * The __remove_mapping() case is easy, again the node is now stale;
-* but if page is swapcache in migrate_page_move_mapping(), it might
-* still be our page, in which case it's essential to keep the node.
+* the same is in reuse_ksm_page() case; but if page is swapcache
+* in migrate_page_move_mapping(), it might still be our page,
+* in which case it's essential to keep the node.
 */
while (!get_page_unless_zero(page)) {
/*
@@ -2648,6 +2649,31 @@ void rmap_walk_ksm(struct page *page, struct 
rmap_walk_control *rwc)
goto again;
 }
 
+bool reuse_ksm_page(struct page *page,
+   struct vm_area_struct *vma,
+   unsigned long address)
+{
+#ifdef CONFIG_DEBUG_VM
+   if (WARN_ON(is_zero_pfn(page_to_pfn(page))) ||
+   WARN_ON(!page_mapped(page)) ||
+   WARN_ON(!PageLocked(page))) {
+   dump_page(page, "reuse_ksm_page");
+   return false;
+   }
+#endif
+
+   if (PageSwapCache(page) || !page_stable_node(page))
+   return false;
+   /* Prohibit parallel get_ksm_page() */
+   if (!page_ref_freeze(page, 1))
+   return false;
+
+   page_move_anon_rmap(page, vma);
+   page->index = linear_page_index(vma, address);
+   page_ref_unfreeze(page, 1);
+
+   return true;
+}
 #ifdef CONFIG_MIGRATION
 void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
diff --git a/mm/memory.c b/mm/memory.c
index c1a05c2484b0..3874acce1472 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2846,8 +2846,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 * Take out anonymous pages first, anonymous shared vmas are
 * not dirty accountable.

[PATCH 2/5] mm: do_wp_page() simplification

2021-04-01 Thread Suren Baghdasaryan

From: Linus Torvalds 

How about we just make sure we're the only possible valid user fo the
page before we bother to reuse it?

Simplify, simplify, simplify.

And get rid of the nasty serialization on the page lock at the same time.

[peterx: add subject prefix]

Signed-off-by: Linus Torvalds 
Signed-off-by: Peter Xu 
Signed-off-by: Linus Torvalds 
---
 mm/memory.c | 58 -
 1 file changed, 17 insertions(+), 41 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3874acce1472..d95a4573a273 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2847,49 +2847,25 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 * not dirty accountable.
 */
if (PageAnon(vmf->page)) {
-   int total_map_swapcount;
-   if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
-  page_count(vmf->page) != 1))
+   struct page *page = vmf->page;
+
+   /* PageKsm() doesn't necessarily raise the page refcount */
+   if (PageKsm(page) || page_count(page) != 1)
+   goto copy;
+   if (!trylock_page(page))
+   goto copy;
+   if (PageKsm(page) || page_mapcount(page) != 1 || 
page_count(page) != 1) {
+   unlock_page(page);
goto copy;
-   if (!trylock_page(vmf->page)) {
-   get_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   lock_page(vmf->page);
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
-   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
-   unlock_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   put_page(vmf->page);
-   return 0;
-   }
-   put_page(vmf->page);
-   }
-   if (PageKsm(vmf->page)) {
-   bool reused = reuse_ksm_page(vmf->page, vmf->vma,
-vmf->address);
-   unlock_page(vmf->page);
-   if (!reused)
-   goto copy;
-   wp_page_reuse(vmf);
-   return VM_FAULT_WRITE;
-   }
-   if (reuse_swap_page(vmf->page, _map_swapcount)) {
-   if (total_map_swapcount == 1) {
-   /*
-* The page is all ours. Move it to
-* our anon_vma so the rmap code will
-* not search our parent or siblings.
-* Protected against the rmap code by
-* the page lock.
-*/
-   page_move_anon_rmap(vmf->page, vma);
-   }
-   unlock_page(vmf->page);
-   wp_page_reuse(vmf);
-   return VM_FAULT_WRITE;
}
-   unlock_page(vmf->page);
+   /*
+* Ok, we've got the only map reference, and the only
+* page count reference, and the page is locked,
+* it's dark out, and we're wearing sunglasses. Hit it.
+*/
+   wp_page_reuse(vmf);
+   unlock_page(page);
+   return VM_FAULT_WRITE;
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
-- 
2.31.0.291.g576ba9dcdaf-goog

[PATCH 0/5] 4.19 backports of fixes for "CoW after fork() issue"

2021-04-01 Thread Suren Baghdasaryan

We received a report that the copy-on-write issue repored by Jann Horn in
https://bugs.chromium.org/p/project-zero/issues/detail?id=2045 is still
reproducible on 4.14 and 4.19 kernels (the first issue with the reproducer
coded in vmsplice.c). I confirmed this and also that the issue was not
reproducible with 5.10 kernel. I tracked the fix to the following patch
introduced in 5.9 which changes the do_wp_page() logic:

09854ba94c6a 'mm: do_wp_page() simplification'

I backported this patch (#2 in the series) along with 2 prerequisite patches
(#1 and #4) that keep the backports clean and two followup fixes to the main
patch (#3 and #5). I had to skip the following fix:

feb889fb40fa 'mm: don't put pinned pages into the swap cache'

because it uses page_maybe_dma_pinned() which does not exists in earlier
kernels. Because pin_user_pages() does not exist there as well, I *think*
we can safely skip this fix on older kernels, but I would appreciate if
someone could confirm that claim.

The patchset cleanly applies over: stable linux-4.19.y, tag: v4.19.184

Note: 4.14 and 4.19 backports are very similar, so while I backported
only to these two versions I think backports for other versions can be
done easily.

Kirill Tkhai (1):
  mm: reuse only-pte-mapped KSM page in do_wp_page()

Linus Torvalds (2):
  mm: do_wp_page() simplification
  mm: fix misplaced unlock_page in do_wp_page()

Nadav Amit (1):
  mm/userfaultfd: fix memory corruption due to writeprotect

Shaohua Li (1):
  userfaultfd: wp: add helper for writeprotect check

 include/linux/ksm.h   |  7 
 include/linux/userfaultfd_k.h | 10 ++
 mm/ksm.c  | 30 --
 mm/memory.c   | 60 ---
 4 files changed, 73 insertions(+), 34 deletions(-)

-- 
2.31.0.291.g576ba9dcdaf-goog

[PATCH 4/5] userfaultfd: wp: add helper for writeprotect check

2021-04-01 Thread Suren Baghdasaryan

From: Shaohua Li 

Patch series "userfaultfd: write protection support", v6.

Overview


The uffd-wp work was initialized by Shaohua Li [1], and later continued by
Andrea [2].  This series is based upon Andrea's latest userfaultfd tree,
and it is a continuous works from both Shaohua and Andrea.  Many of the
follow up ideas come from Andrea too.

Besides the old MISSING register mode of userfaultfd, the new uffd-wp
support provides another alternative register mode called
UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
page faults but also write protection page faults, or even they can be
registered together.  At the same time, the new feature also provides a
new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
userspace to write protect a range or memory or fixup write permission of
faulted pages.

Please refer to the document patch "userfaultfd: wp:
UFFDIO_REGISTER_MODE_WP documentation update" for more information on the
new interface and what it can do.

The major workflow of an uffd-wp program should be:

  1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP

  2. Write protect part of the whole registered region using
 UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
 show that we want to write protect the range.

  3. Start a working thread that modifies the protected pages,
 meanwhile listening to UFFD messages.

  4. When a write is detected upon the protected range, page fault
 happens, a UFFD message will be generated and reported to the
 page fault handling thread

  5. The page fault handler thread resolves the page fault using the
 new UFFDIO_WRITEPROTECT ioctl, but this time passing in
 !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
 recover the write permission.  Before this operation, the fault
 handler thread can do anything it wants, e.g., dumps the page to
 a persistent storage.

  6. The worker thread will continue running with the correctly
 applied write permission from step 5.

Currently there are already two projects that are based on this new
userfaultfd feature.

QEMU Live Snapshot: The project provides a way to allow the QEMU
hypervisor to take snapshot of VMs without
stopping the VM [3].

LLNL umap library:  The project provides a mmap-like interface and
"allow to have an application specific buffer of
pages cached from a large file, i.e. out-of-core
execution using memory map" [4][5].

Before posting the patchset, this series was smoke tested against QEMU
live snapshot and the LLNL umap library (by doing parallel quicksort using
128 sorting threads + 80 uffd servicing threads).  My sincere thanks to
Marty Mcfadden and Denis Plotnikov for the help along the way.

TODO


- hugetlbfs/shmem support
- performance
- more architectures
- cooperate with mprotect()-allowed processes (???)
- ...

References
==

[1] https://lwn.net/Articles/666187/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
[3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
[4] https://github.com/LLNL/umap
[5] https://llnl-umap.readthedocs.io/en/develop/
[6] 
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
[7] https://lkml.org/lkml/2018/11/21/370
[8] https://lkml.org/lkml/2018/12/30/64

This patch (of 19):

Add helper for writeprotect check. Will use it later.

Signed-off-by: Shaohua Li 
Signed-off-by: Andrea Arcangeli 
Signed-off-by: Peter Xu 
Signed-off-by: Andrew Morton 
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Bobby Powers 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Denis Plotnikov 
Cc: "Dr . David Alan Gilbert" 
Cc: Martin Cracauer 
Cc: Marty McFadden 
Cc: Maya Gokhale 
Cc: Mike Kravetz 
Cc: Pavel Emelyanov 
Link: http://lkml.kernel.org/r/20200220163112.11409-2-pet...@redhat.com
Signed-off-by: Linus Torvalds 
---
 include/linux/userfaultfd_k.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index f2f3b68ba910..07878cd475f2 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -48,6 +48,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -91,6 +96,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return false;
 }
 
+static inline bool userfaultfd_wp(struct

[PATCH 5/5] mm/userfaultfd: fix memory corruption due to writeprotect

2021-04-01 Thread Suren Baghdasaryan

From: Nadav Amit 

Userfaultfd self-test fails occasionally, indicating a memory corruption.

Analyzing this problem indicates that there is a real bug since mmap_lock
is only taken for read in mwriteprotect_range() and defers flushes, and
since there is insufficient consideration of concurrent deferred TLB
flushes in wp_page_copy().  Although the PTE is flushed from the TLBs in
wp_page_copy(), this flush takes place after the copy has already been
performed, and therefore changes of the page are possible between the time
of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses a
problem.  Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural PTE
permission: when userfaultfd_writeprotect() unprotects memory region, it
unintentionally *clears* the RW-bit if it was already set.  Note that this
unprotecting a PTE that is not write-protected is a valid use-case: the
userfaultfd monitor might ask to unprotect a region that holds both
write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0cpu1cpu2

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
[ page-fault ]
...
wp_page_copy()
 cow_user_page()
  [ copy page ]
[ write to old
  page ]
...
 set_pte_at_notify()

A similar scenario can happen:

cpu0cpu1cpu2cpu3

[ Writable PTE
  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
userfaultfd_writeprotect()
[ write-unprotect ]
[ deferred TLB flush]
[ page-fault ]
wp_page_copy()
 cow_user_page()
 [ copy page ]
 ...[ write to page ]
set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit").  Yet, as Yu Zhao pointed, these races became apparent
since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if page_count(page)
> 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.

Link: https://lkml.kernel.org/r/20210304095423.3825684-1-na...@vmware.com
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Nadav Amit 
Suggested-by: Yu Zhao 
Reviewed-by: Peter Xu 
Tested-by: Peter Xu 
Cc: Andrea Arcangeli 
Cc: Andy Lutomirski 
Cc: Pavel Emelyanov 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Minchan Kim 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: [5.9+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
---
 mm/memory.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 14470ceaf3f2..3f33651a2a39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2810,6 +2810,14 @@ static int do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
+   /*
+* Userfaultfd write-protect can defer flushes. Ensure the TLB
+* is flushed in this case before copying.
+*/
+   if (unlikely(userfaultfd_wp(vmf->vma) &&
+mm_tlb_flush_pending(vmf->vma->vm_mm)))
+   flush_tlb_page(vmf->vma, vmf->address);
+
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
-- 
2.31.0.291.g576ba9dcdaf-goog

[PATCH 3/5] mm: fix misplaced unlock_page in do_wp_page()

2021-04-01 Thread Suren Baghdasaryan

From: Linus Torvalds 

Commit 09854ba94c6a ("mm: do_wp_page() simplification") reorganized all
the code around the page re-use vs copy, but in the process also moved
the final unlock_page() around to after the wp_page_reuse() call.

That normally doesn't matter - but it means that the unlock_page() is
now done after releasing the page table lock.  Again, not a big deal,
you'd think.

But it turns out that it's very wrong indeed, because once we've
released the page table lock, we've basically lost our only reference to
the page - the page tables - and it could now be free'd at any time.  We
do hold the mmap_sem, so no actual unmap() can happen, but madvise can
come in and a MADV_DONTNEED will zap the page range - and free the page.

So now the page may be free'd just as we're unlocking it, which in turn
will usually trigger a "Bad page state" error in the freeing path.  To
make matters more confusing, by the time the debug code prints out the
page state, the unlock has typically completed and everything looks fine
again.

This all doesn't happen in any normal situations, but it does trigger
with the dirtyc0w_child LTP test.  And it seems to trigger much more
easily (but not expclusively) on s390 than elsewhere, probably because
s390 doesn't do the "batch pages up for freeing after the TLB flush"
that gives the unlock_page() more time to complete and makes the race
harder to hit.

Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Link: 
https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.ca...@redhat.com/
Link: 
https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690f...@linux.alibaba.com/
Reported-by: Qian Cai 
Reported-by: Alex Shi 
Bisected-and-analyzed-by: Gerald Schaefer 
Tested-by: Gerald Schaefer 
Signed-off-by: Linus Torvalds 
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e84648d81d6d..14470ceaf3f2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,8 +2848,8 @@ static int do_wp_page(struct vm_fault *vmf)
 * page count reference, and the page is locked,
 * it's dark out, and we're wearing sunglasses. Hit it.
 */
-   wp_page_reuse(vmf);
unlock_page(page);
+   wp_page_reuse(vmf);
return VM_FAULT_WRITE;
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
-- 
2.31.0.291.g576ba9dcdaf-goog

[PATCH 2/5] mm: do_wp_page() simplification

2021-04-01 Thread Suren Baghdasaryan

From: Linus Torvalds 

How about we just make sure we're the only possible valid user fo the
page before we bother to reuse it?

Simplify, simplify, simplify.

And get rid of the nasty serialization on the page lock at the same time.

[peterx: add subject prefix]

Signed-off-by: Linus Torvalds 
Signed-off-by: Peter Xu 
Signed-off-by: Linus Torvalds 
---
 mm/memory.c | 58 -
 1 file changed, 17 insertions(+), 41 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6920bfb3f89c..e84648d81d6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2832,49 +2832,25 @@ static int do_wp_page(struct vm_fault *vmf)
 * not dirty accountable.
 */
if (PageAnon(vmf->page)) {
-   int total_map_swapcount;
-   if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
-  page_count(vmf->page) != 1))
+   struct page *page = vmf->page;
+
+   /* PageKsm() doesn't necessarily raise the page refcount */
+   if (PageKsm(page) || page_count(page) != 1)
+   goto copy;
+   if (!trylock_page(page))
+   goto copy;
+   if (PageKsm(page) || page_mapcount(page) != 1 || 
page_count(page) != 1) {
+   unlock_page(page);
goto copy;
-   if (!trylock_page(vmf->page)) {
-   get_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   lock_page(vmf->page);
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
-   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
-   unlock_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   put_page(vmf->page);
-   return 0;
-   }
-   put_page(vmf->page);
-   }
-   if (PageKsm(vmf->page)) {
-   bool reused = reuse_ksm_page(vmf->page, vmf->vma,
-vmf->address);
-   unlock_page(vmf->page);
-   if (!reused)
-   goto copy;
-   wp_page_reuse(vmf);
-   return VM_FAULT_WRITE;
-   }
-   if (reuse_swap_page(vmf->page, _map_swapcount)) {
-   if (total_map_swapcount == 1) {
-   /*
-* The page is all ours. Move it to
-* our anon_vma so the rmap code will
-* not search our parent or siblings.
-* Protected against the rmap code by
-* the page lock.
-*/
-   page_move_anon_rmap(vmf->page, vma);
-   }
-   unlock_page(vmf->page);
-   wp_page_reuse(vmf);
-   return VM_FAULT_WRITE;
}
-   unlock_page(vmf->page);
+   /*
+* Ok, we've got the only map reference, and the only
+* page count reference, and the page is locked,
+* it's dark out, and we're wearing sunglasses. Hit it.
+*/
+   wp_page_reuse(vmf);
+   unlock_page(page);
+   return VM_FAULT_WRITE;
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
-- 
2.31.0.291.g576ba9dcdaf-goog

[PATCH 1/5] mm: reuse only-pte-mapped KSM page in do_wp_page()

2021-04-01 Thread Suren Baghdasaryan

From: Kirill Tkhai 

Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages.  If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.

[ Note that we do not consider PageSwapCache() pages at least for now,
  since we don't want to complicate __get_ksm_page(), which has nice
  optimization based on this (for the migration case). Currenly it is
  spinning on PageSwapCache() pages, waiting for when they have
  unfreezed counters (i.e., for the migration finish). But we don't want
  to make it also spinning on swap cache pages, which we try to reuse,
  since there is not a very high probability to reuse them. So, for now
  we do not consider PageSwapCache() pages at all. ]

So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing.  After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.

[a...@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: 
http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Reviewed-by: Yang Shi 
Cc: "Kirill A. Shutemov" 
Cc: Hugh Dickins 
Cc: Andrea Arcangeli 
Cc: Christian Koenig 
Cc: Claudio Imbrenda 
Cc: Rik van Riel 
Cc: Huang Ying 
Cc: Minchan Kim 
Cc: Kirill Tkhai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
---
 include/linux/ksm.h |  7 +++
 mm/ksm.c| 30 --
 mm/memory.c | 16 ++--
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 44368b19b27e..def48a2d87aa 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -64,6 +64,8 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+bool reuse_ksm_page(struct page *page,
+   struct vm_area_struct *vma, unsigned long address);
 
 #else  /* !CONFIG_KSM */
 
@@ -103,6 +105,11 @@ static inline void rmap_walk_ksm(struct page *page,
 static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
 }
+static inline bool reuse_ksm_page(struct page *page,
+   struct vm_area_struct *vma, unsigned long address)
+{
+   return false;
+}
 #endif /* CONFIG_MMU */
 #endif /* !CONFIG_KSM */
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 65d4bf52f543..62419735ee9c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -695,8 +695,9 @@ static struct page *get_ksm_page(struct stable_node 
*stable_node, bool lock_it)
 * case this node is no longer referenced, and should be freed;
 * however, it might mean that the page is under page_freeze_refs().
 * The __remove_mapping() case is easy, again the node is now stale;
-* but if page is swapcache in migrate_page_move_mapping(), it might
-* still be our page, in which case it's essential to keep the node.
+* the same is in reuse_ksm_page() case; but if page is swapcache
+* in migrate_page_move_mapping(), it might still be our page,
+* in which case it's essential to keep the node.
 */
while (!get_page_unless_zero(page)) {
/*
@@ -2609,6 +2610,31 @@ void rmap_walk_ksm(struct page *page, struct 
rmap_walk_control *rwc)
goto again;
 }
 
+bool reuse_ksm_page(struct page *page,
+   struct vm_area_struct *vma,
+   unsigned long address)
+{
+#ifdef CONFIG_DEBUG_VM
+   if (WARN_ON(is_zero_pfn(page_to_pfn(page))) ||
+   WARN_ON(!page_mapped(page)) ||
+   WARN_ON(!PageLocked(page))) {
+   dump_page(page, "reuse_ksm_page");
+   return false;
+   }
+#endif
+
+   if (PageSwapCache(page) || !page_stable_node(page))
+   return false;
+   /* Prohibit parallel get_ksm_page() */
+   if (!page_ref_freeze(page, 1))
+   return false;
+
+   page_move_anon_rmap(page, vma);
+   page->index = linear_page_index(vma, address);
+   page_ref_unfreeze(page, 1);
+
+   return true;
+}
 #ifdef CONFIG_MIGRATION
 void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 21a0bbb9c21f..6920bfb3f89c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2831,8 +2831,11 @@ static int do_wp_page(struct vm_fault *vmf)
 * Take out anonymous pages first, anonymous shared vmas are
 * not dirty accountable.

[PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-01 Thread Suren Baghdasaryan

We received a report that the copy-on-write issue repored by Jann Horn in
https://bugs.chromium.org/p/project-zero/issues/detail?id=2045 is still
reproducible on 4.14 and 4.19 kernels (the first issue with the reproducer
coded in vmsplice.c). I confirmed this and also that the issue was not
reproducible with 5.10 kernel. I tracked the fix to the following patch
introduced in 5.9 which changes the do_wp_page() logic:

09854ba94c6a 'mm: do_wp_page() simplification'

I backported this patch (#2 in the series) along with 2 prerequisite patches
(#1 and #4) that keep the backports clean and two followup fixes to the main
patch (#3 and #5). I had to skip the following fix:

feb889fb40fa 'mm: don't put pinned pages into the swap cache'

because it uses page_maybe_dma_pinned() which does not exists in earlier
kernels. Because pin_user_pages() does not exist there as well, I *think*
we can safely skip this fix on older kernels, but I would appreciate if
someone could confirm that claim.

The patchset cleanly applies over: stable linux-4.14.y, tag: v4.14.228

Note: 4.14 and 4.19 backports are very similar, so while I backported
only to these two versions I think backports for other versions can be
done easily.

Kirill Tkhai (1):
  mm: reuse only-pte-mapped KSM page in do_wp_page()

Linus Torvalds (2):
  mm: do_wp_page() simplification
  mm: fix misplaced unlock_page in do_wp_page()

Nadav Amit (1):
  mm/userfaultfd: fix memory corruption due to writeprotect

Shaohua Li (1):
  userfaultfd: wp: add helper for writeprotect check

 include/linux/ksm.h   |  7 
 include/linux/userfaultfd_k.h | 10 ++
 mm/ksm.c  | 30 --
 mm/memory.c   | 60 ---
 4 files changed, 73 insertions(+), 34 deletions(-)

-- 
2.31.0.291.g576ba9dcdaf-goog

Re: [PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-05 Thread Suren Baghdasaryan

On Fri, Mar 5, 2021 at 10:23 AM David Hildenbrand  wrote:
>
> On 05.03.21 19:08, Suren Baghdasaryan wrote:
> > On Fri, Mar 5, 2021 at 9:52 AM David Hildenbrand  wrote:
> >>
> >> On 05.03.21 18:45, Shakeel Butt wrote:
> >>> On Fri, Mar 5, 2021 at 9:37 AM David Hildenbrand  wrote:
> >>>>
> >>>> On 04.03.21 01:03, Shakeel Butt wrote:
> >>>>> On Wed, Mar 3, 2021 at 3:34 PM Suren Baghdasaryan  
> >>>>> wrote:
> >>>>>>
> >>>>>> On Wed, Mar 3, 2021 at 3:17 PM Shakeel Butt  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On Wed, Mar 3, 2021 at 10:58 AM Suren Baghdasaryan 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> process_madvise currently requires ptrace attach capability.
> >>>>>>>> PTRACE_MODE_ATTACH gives one process complete control over another
> >>>>>>>> process. It effectively removes the security boundary between the
> >>>>>>>> two processes (in one direction). Granting ptrace attach capability
> >>>>>>>> even to a system process is considered dangerous since it creates an
> >>>>>>>> attack surface. This severely limits the usage of this API.
> >>>>>>>> The operations process_madvise can perform do not affect the 
> >>>>>>>> correctness
> >>>>>>>> of the operation of the target process; they only affect where the 
> >>>>>>>> data
> >>>>>>>> is physically located (and therefore, how fast it can be accessed).
> >>>>>>>> What we want is the ability for one process to influence another 
> >>>>>>>> process
> >>>>>>>> in order to optimize performance across the entire system while 
> >>>>>>>> leaving
> >>>>>>>> the security boundary intact.
> >>>>>>>> Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> >>>>>>>> and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> >>>>>>>> and CAP_SYS_NICE for influencing process performance.
> >>>>>>>>
> >>>>>>>> Cc: sta...@vger.kernel.org # 5.10+
> >>>>>>>> Signed-off-by: Suren Baghdasaryan 
> >>>>>>>> Reviewed-by: Kees Cook 
> >>>>>>>> Acked-by: Minchan Kim 
> >>>>>>>> Acked-by: David Rientjes 
> >>>>>>>> ---
> >>>>>>>> changes in v3
> >>>>>>>> - Added Reviewed-by: Kees Cook 
> >>>>>>>> - Created man page for process_madvise per Andrew's request: 
> >>>>>>>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> >>>>>>>> - cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
> >>>>>>>> - cc'ed linux-security-mod...@vger.kernel.org per James Morris's 
> >>>>>>>> request
> >>>>>>>>
> >>>>>>>> mm/madvise.c | 13 -
> >>>>>>>> 1 file changed, 12 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>>>>>> index df692d2e35d4..01fef79ac761 100644
> >>>>>>>> --- a/mm/madvise.c
> >>>>>>>> +++ b/mm/madvise.c
> >>>>>>>> @@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> >>>>>>>> const struct iovec __user *, vec,
> >>>>>>>>goto release_task;
> >>>>>>>>}
> >>>>>>>>
> >>>>>>>> -   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> >>>>>>>> +   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. 
> >>>>>>>> */
> >>>>>>>> +   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> >>>>>>>>if (IS_ERR_OR_NULL(mm)) {
> >>>>>>>>ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >>>>>>>&

Re: [PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-05 Thread Suren Baghdasaryan

On Fri, Mar 5, 2021 at 9:52 AM David Hildenbrand  wrote:
>
> On 05.03.21 18:45, Shakeel Butt wrote:
> > On Fri, Mar 5, 2021 at 9:37 AM David Hildenbrand  wrote:
> >>
> >> On 04.03.21 01:03, Shakeel Butt wrote:
> >>> On Wed, Mar 3, 2021 at 3:34 PM Suren Baghdasaryan  
> >>> wrote:
> >>>>
> >>>> On Wed, Mar 3, 2021 at 3:17 PM Shakeel Butt  wrote:
> >>>>>
> >>>>> On Wed, Mar 3, 2021 at 10:58 AM Suren Baghdasaryan  
> >>>>> wrote:
> >>>>>>
> >>>>>> process_madvise currently requires ptrace attach capability.
> >>>>>> PTRACE_MODE_ATTACH gives one process complete control over another
> >>>>>> process. It effectively removes the security boundary between the
> >>>>>> two processes (in one direction). Granting ptrace attach capability
> >>>>>> even to a system process is considered dangerous since it creates an
> >>>>>> attack surface. This severely limits the usage of this API.
> >>>>>> The operations process_madvise can perform do not affect the 
> >>>>>> correctness
> >>>>>> of the operation of the target process; they only affect where the data
> >>>>>> is physically located (and therefore, how fast it can be accessed).
> >>>>>> What we want is the ability for one process to influence another 
> >>>>>> process
> >>>>>> in order to optimize performance across the entire system while leaving
> >>>>>> the security boundary intact.
> >>>>>> Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> >>>>>> and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> >>>>>> and CAP_SYS_NICE for influencing process performance.
> >>>>>>
> >>>>>> Cc: sta...@vger.kernel.org # 5.10+
> >>>>>> Signed-off-by: Suren Baghdasaryan 
> >>>>>> Reviewed-by: Kees Cook 
> >>>>>> Acked-by: Minchan Kim 
> >>>>>> Acked-by: David Rientjes 
> >>>>>> ---
> >>>>>> changes in v3
> >>>>>> - Added Reviewed-by: Kees Cook 
> >>>>>> - Created man page for process_madvise per Andrew's request: 
> >>>>>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> >>>>>> - cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
> >>>>>> - cc'ed linux-security-mod...@vger.kernel.org per James Morris's 
> >>>>>> request
> >>>>>>
> >>>>>>mm/madvise.c | 13 -
> >>>>>>1 file changed, 12 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>>>> index df692d2e35d4..01fef79ac761 100644
> >>>>>> --- a/mm/madvise.c
> >>>>>> +++ b/mm/madvise.c
> >>>>>> @@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> >>>>>> const struct iovec __user *, vec,
> >>>>>>   goto release_task;
> >>>>>>   }
> >>>>>>
> >>>>>> -   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> >>>>>> +   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> >>>>>> +   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> >>>>>>   if (IS_ERR_OR_NULL(mm)) {
> >>>>>>   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >>>>>>   goto release_task;
> >>>>>>   }
> >>>>>>
> >>>>>> +   /*
> >>>>>> +* Require CAP_SYS_NICE for influencing process performance. 
> >>>>>> Note that
> >>>>>> +* only non-destructive hints are currently supported.
> >>>>>
> >>>>> How is non-destructive defined? Is MADV_DONTNEED non-destructive?
> >>>>
> >>>> Non-destructive in this context means the data is not lost and can be
> >>>> recovered. I follow the logic described in
> >>>> https://lwn.net/Articles/794704/ where Minchan was introducing
> >>>&g

Re: [PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Suren Baghdasaryan

On Wed, Mar 3, 2021 at 4:04 PM Shakeel Butt  wrote:
>
> On Wed, Mar 3, 2021 at 3:34 PM Suren Baghdasaryan  wrote:
> >
> > On Wed, Mar 3, 2021 at 3:17 PM Shakeel Butt  wrote:
> > >
> > > On Wed, Mar 3, 2021 at 10:58 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > process_madvise currently requires ptrace attach capability.
> > > > PTRACE_MODE_ATTACH gives one process complete control over another
> > > > process. It effectively removes the security boundary between the
> > > > two processes (in one direction). Granting ptrace attach capability
> > > > even to a system process is considered dangerous since it creates an
> > > > attack surface. This severely limits the usage of this API.
> > > > The operations process_madvise can perform do not affect the correctness
> > > > of the operation of the target process; they only affect where the data
> > > > is physically located (and therefore, how fast it can be accessed).
> > > > What we want is the ability for one process to influence another process
> > > > in order to optimize performance across the entire system while leaving
> > > > the security boundary intact.
> > > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > > > and CAP_SYS_NICE for influencing process performance.
> > > >
> > > > Cc: sta...@vger.kernel.org # 5.10+
> > > > Signed-off-by: Suren Baghdasaryan 
> > > > Reviewed-by: Kees Cook 
> > > > Acked-by: Minchan Kim 
> > > > Acked-by: David Rientjes 
> > > > ---
> > > > changes in v3
> > > > - Added Reviewed-by: Kees Cook 
> > > > - Created man page for process_madvise per Andrew's request: 
> > > > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> > > > - cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
> > > > - cc'ed linux-security-mod...@vger.kernel.org per James Morris's request
> > > >
> > > >  mm/madvise.c | 13 -
> > > >  1 file changed, 12 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index df692d2e35d4..01fef79ac761 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> > > > const struct iovec __user *, vec,
> > > > goto release_task;
> > > > }
> > > >
> > > > -   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > > > +   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > > > +   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> > > > if (IS_ERR_OR_NULL(mm)) {
> > > > ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > > > goto release_task;
> > > > }
> > > >
> > > > +   /*
> > > > +* Require CAP_SYS_NICE for influencing process performance. 
> > > > Note that
> > > > +* only non-destructive hints are currently supported.
> > >
> > > How is non-destructive defined? Is MADV_DONTNEED non-destructive?
> >
> > Non-destructive in this context means the data is not lost and can be
> > recovered. I follow the logic described in
> > https://lwn.net/Articles/794704/ where Minchan was introducing
> > MADV_COLD and MADV_PAGEOUT as non-destructive versions of MADV_FREE
> > and MADV_DONTNEED. Following that logic, MADV_FREE and MADV_DONTNEED
> > would be considered destructive hints.
> > Note that process_madvise_behavior_valid() allows only MADV_COLD and
> > MADV_PAGEOUT at the moment, which are both non-destructive.
> >
>
> There is a plan to support MADV_DONTNEED for this syscall. Do we need
> to change these access checks again with that support?

I think so. Destructive hints affect the data, so we will probably
need stricter checks for those hints.

Re: [PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Suren Baghdasaryan

On Wed, Mar 3, 2021 at 3:17 PM Shakeel Butt  wrote:
>
> On Wed, Mar 3, 2021 at 10:58 AM Suren Baghdasaryan  wrote:
> >
> > process_madvise currently requires ptrace attach capability.
> > PTRACE_MODE_ATTACH gives one process complete control over another
> > process. It effectively removes the security boundary between the
> > two processes (in one direction). Granting ptrace attach capability
> > even to a system process is considered dangerous since it creates an
> > attack surface. This severely limits the usage of this API.
> > The operations process_madvise can perform do not affect the correctness
> > of the operation of the target process; they only affect where the data
> > is physically located (and therefore, how fast it can be accessed).
> > What we want is the ability for one process to influence another process
> > in order to optimize performance across the entire system while leaving
> > the security boundary intact.
> > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > and CAP_SYS_NICE for influencing process performance.
> >
> > Cc: sta...@vger.kernel.org # 5.10+
> > Signed-off-by: Suren Baghdasaryan 
> > Reviewed-by: Kees Cook 
> > Acked-by: Minchan Kim 
> > Acked-by: David Rientjes 
> > ---
> > changes in v3
> > - Added Reviewed-by: Kees Cook 
> > - Created man page for process_madvise per Andrew's request: 
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> > - cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
> > - cc'ed linux-security-mod...@vger.kernel.org per James Morris's request
> >
> >  mm/madvise.c | 13 -
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index df692d2e35d4..01fef79ac761 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> > struct iovec __user *, vec,
> > goto release_task;
> > }
> >
> > -   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > +   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > +   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> > if (IS_ERR_OR_NULL(mm)) {
> > ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > goto release_task;
> > }
> >
> > +   /*
> > +* Require CAP_SYS_NICE for influencing process performance. Note 
> > that
> > +* only non-destructive hints are currently supported.
>
> How is non-destructive defined? Is MADV_DONTNEED non-destructive?

Non-destructive in this context means the data is not lost and can be
recovered. I follow the logic described in
https://lwn.net/Articles/794704/ where Minchan was introducing
MADV_COLD and MADV_PAGEOUT as non-destructive versions of MADV_FREE
and MADV_DONTNEED. Following that logic, MADV_FREE and MADV_DONTNEED
would be considered destructive hints.
Note that process_madvise_behavior_valid() allows only MADV_COLD and
MADV_PAGEOUT at the moment, which are both non-destructive.

>
> > +*/
> > +   if (!capable(CAP_SYS_NICE)) {
> > +   ret = -EPERM;
> > +   goto release_mm;
> > +   }
> > +
> > total_len = iov_iter_count();
> >
> > while (iov_iter_count()) {
> > @@ -1218,6 +1228,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> > struct iovec __user *, vec,
> > if (ret == 0)
> > ret = total_len - iov_iter_count();
> >
> > +release_mm:
> > mmput(mm);
> >  release_task:
> > put_task_struct(task);
> > --
> > 2.30.1.766.gb4fecdf3b7-goog
> >

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Suren Baghdasaryan

On Tue, Mar 2, 2021 at 4:19 PM Suren Baghdasaryan  wrote:
>
> On Tue, Mar 2, 2021 at 4:17 PM Andrew Morton  
> wrote:
> >
> > On Tue, 2 Mar 2021 15:53:39 -0800 Suren Baghdasaryan  
> > wrote:
> >
> > > Hi Andrew,
> > > A friendly reminder to please include this patch into mm tree.
> > > There seem to be no more questions or objections.
> > > The man page you requested is accepted here:
> > > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> > > stable is CC'ed and this patch should go into 5.10 and later kernels
> > > The patch has been:
> > > Acked-by: Minchan Kim 
> > > Acked-by: David Rientjes 
> > > Reviewed-by: Kees Cook 
> > >
> > > If you want me to resend it, please let me know.
> >
> > This patch was tough.  I think it would be best to resend please, being
> > sure to cc everyone who commented.  To give everyone another chance to
> > get their heads around it.  If necessary, please update the changelog
> > to address any confusion/questions which have arisen thus far.
>
> Sure, will do. Thanks!

Posted v3 at 
https://lore.kernel.org/linux-mm/20210303185807.2160264-1-sur...@google.com/

>
> >
> > Thanks.

[PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Suren Baghdasaryan

process_madvise currently requires ptrace attach capability.
PTRACE_MODE_ATTACH gives one process complete control over another
process. It effectively removes the security boundary between the
two processes (in one direction). Granting ptrace attach capability
even to a system process is considered dangerous since it creates an
attack surface. This severely limits the usage of this API.
The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data
is physically located (and therefore, how fast it can be accessed).
What we want is the ability for one process to influence another process
in order to optimize performance across the entire system while leaving
the security boundary intact.
Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
and CAP_SYS_NICE for influencing process performance.

Cc: sta...@vger.kernel.org # 5.10+
Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Kees Cook 
Acked-by: Minchan Kim 
Acked-by: David Rientjes 
---
changes in v3
- Added Reviewed-by: Kees Cook 
- Created man page for process_madvise per Andrew's request: 
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
- cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
- cc'ed linux-security-mod...@vger.kernel.org per James Morris's request

 mm/madvise.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index df692d2e35d4..01fef79ac761 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
struct iovec __user *, vec,
goto release_task;
}
 
-   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
+   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}
 
+   /*
+* Require CAP_SYS_NICE for influencing process performance. Note that
+* only non-destructive hints are currently supported.
+*/
+   if (!capable(CAP_SYS_NICE)) {
+   ret = -EPERM;
+   goto release_mm;
+   }
+
total_len = iov_iter_count();
 
while (iov_iter_count()) {
@@ -1218,6 +1228,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct 
iovec __user *, vec,
if (ret == 0)
ret = total_len - iov_iter_count();
 
+release_mm:
mmput(mm);
 release_task:
put_task_struct(task);
-- 
2.30.1.766.gb4fecdf3b7-goog

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Suren Baghdasaryan

On Tue, Mar 2, 2021 at 4:17 PM Andrew Morton  wrote:
>
> On Tue, 2 Mar 2021 15:53:39 -0800 Suren Baghdasaryan  
> wrote:
>
> > Hi Andrew,
> > A friendly reminder to please include this patch into mm tree.
> > There seem to be no more questions or objections.
> > The man page you requested is accepted here:
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> > stable is CC'ed and this patch should go into 5.10 and later kernels
> > The patch has been:
> > Acked-by: Minchan Kim 
> > Acked-by: David Rientjes 
> > Reviewed-by: Kees Cook 
> >
> > If you want me to resend it, please let me know.
>
> This patch was tough.  I think it would be best to resend please, being
> sure to cc everyone who commented.  To give everyone another chance to
> get their heads around it.  If necessary, please update the changelog
> to address any confusion/questions which have arisen thus far.

Sure, will do. Thanks!

>
> Thanks.

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Suren Baghdasaryan

On Mon, Feb 1, 2021 at 9:34 PM Suren Baghdasaryan  wrote:
>
> On Thu, Jan 28, 2021 at 11:08 PM Suren Baghdasaryan  wrote:
> >
> > On Thu, Jan 28, 2021 at 11:51 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Tue, Jan 26, 2021 at 5:52 AM 'Michal Hocko' via kernel-team
> > >  wrote:
> > > >
> > > > On Wed 20-01-21 14:17:39, Jann Horn wrote:
> > > > > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > > > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On 01/12, Michal Hocko wrote:
> > > > > > > > >
> > > > > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > > > > >
> > > > > > > > > > What we want is the ability for one process to influence 
> > > > > > > > > > another process
> > > > > > > > > > in order to optimize performance across the entire system 
> > > > > > > > > > while leaving
> > > > > > > > > > the security boundary intact.
> > > > > > > > > > Replace PTRACE_MODE_ATTACH with a combination of 
> > > > > > > > > > PTRACE_MODE_READ
> > > > > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > > > > metadata
> > > > > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > > > > >
> > > > > > > > > I have to say that ptrace modes are rather obscure to me. So 
> > > > > > > > > I cannot
> > > > > > > > > really judge whether MODE_READ is sufficient. My 
> > > > > > > > > understanding has
> > > > > > > > > always been that this is requred to RO access to the address 
> > > > > > > > > space. But
> > > > > > > > > this operation clearly has a visible side effect. Do we have 
> > > > > > > > > any actual
> > > > > > > > > documentation for the existing modes?
> > > > > > > > >
> > > > > > > > > I would be really curious to hear from Jann and Oleg (now 
> > > > > > > > > Cced).
> > > > > > > >
> > > > > > > > Can't comment, sorry. I never understood these security checks 
> > > > > > > > and never tried.
> > > > > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I 
> > > > > > > > have no idea what
> > > > > > > > is the difference.
> > > > >
> > > > > Yama in particular only does its checks on ATTACH and ignores READ,
> > > > > that's the difference you're probably most likely to encounter on a
> > > > > normal desktop system, since some distros turn Yama on by default.
> > > > > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > > > > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > > > > still works; so you can see things like detailed memory usage
> > > > > information and such, but you're not supposed to be able to directly
> > > > > peek into a running SSH client and inject data into the existing SSH
> > > > > connection, or steal the cryptographic keys for the current
> > > > > connection, or something like that.
> > > > >
> > > > > > > I haven't seen a written explanation on ptrace modes but when I
> > > > > > > consulted Jann his explanation was:
> > > > > > >
> > > > > > > PTRACE_MODE_READ means you can inspect metadata about processes 
> > > > > > > with
> > > > > > > the specified domain, across UID boundaries.
> > > > > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with 
> > > > > > > the
> > > > > > > specified domain, across UID boundaries.
> > > > > >
> > > > > > Maybe this would be a good start to document expectations. Some more
&g

Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-18 Thread Suren Baghdasaryan

On Wed, Feb 17, 2021 at 11:55 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> >> Thanks. I added a few words to clarify this.>
> > Any link where I can see the final version?
>
> Sure:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/process_madvise.2
>
> Also rendered below.

Looks great. Thanks for improving it, Michael!

>
> Thanks,
>
> Michael
>
> NAME
>process_madvise - give advice about use of memory to a process
>
> SYNOPSIS
>#include 
>
>ssize_t process_madvise(int pidfd, const struct iovec *iovec,
>size_t vlen, int advice,
>unsigned int flags);
>
>Note: There is no glibc wrapper for this system call; see NOTES.
>
> DESCRIPTION
>The process_madvise() system call is used to give advice or direc‐
>tions to the kernel about the address ranges of another process or
>of  the  calling  process.  It provides the advice for the address
>ranges described by iovec and vlen.  The goal of such advice is to
>improve system or application performance.
>
>The  pidfd  argument  is a PID file descriptor (see pidfd_open(2))
>that specifies the process to which the advice is to be applied.
>
>The pointer iovec points to an array of iovec structures,  defined
>in  as:
>
>struct iovec {
>void  *iov_base;/* Starting address */
>size_t iov_len; /* Length of region */
>};
>
>The iovec structure describes address ranges beginning at iov_base
>address and with the size of iov_len bytes.
>
>The vlen specifies the number of elements in the iovec  structure.
>This value must be less than or equal to IOV_MAX (defined in its.h> or accessible via the call sysconf(_SC_IOV_MAX)).
>
>The advice argument is one of the following values:
>
>MADV_COLD
>   See madvise(2).
>
>MADV_PAGEOUT
>   See madvise(2).
>
>The flags argument is reserved for future use; currently, this ar‐
>gument must be specified as 0.
>
>The  vlen  and iovec arguments are checked before applying any ad‐
>vice.  If vlen is too big, or iovec is invalid, then an error will
>be returned immediately and no advice will be applied.
>
>The  advice might be applied to only a part of iovec if one of its
>elements points to an invalid memory region in the remote process.
>No further elements will be processed beyond that point.  (See the
>discussion regarding partial advice in RETURN VALUE.)
>
>Permission to apply advice to another process  is  governed  by  a
>ptrace   access   mode   PTRACE_MODE_READ_REALCREDS   check   (see
>ptrace(2)); in addition, because of the  performance  implications
>of applying the advice, the caller must have the CAP_SYS_ADMIN ca‐
>pability.
>
> RETURN VALUE
>On success, process_madvise() returns the number of bytes advised.
>This  return  value may be less than the total number of requested
>bytes, if an error occurred after some iovec elements were already
>processed.   The caller should check the return value to determine
>whether a partial advice occurred.
>
>On error, -1 is returned and errno is set to indicate the error.
>
> ERRORS
>EBADF  pidfd is not a valid PID file descriptor.
>
>EFAULT The memory described by iovec is outside the accessible ad‐
>   dress space of the process referred to by pidfd.
>
>EINVAL flags is not 0.
>
>EINVAL The  sum of the iov_len values of iovec overflows a ssize_t
>   value.
>
>EINVAL vlen is too large.
>
>ENOMEM Could not allocate memory for internal copies of the  iovec
>   structures.
>
>EPERM  The  caller  does not have permission to access the address
>   space of the process pidfd.
>
>ESRCH  The target process does not exist (i.e., it has  terminated
>   and been waited on).
>
> VERSIONS
>This  system  call first appeared in Linux 5.10.  Support for this
>system call is optional, depending on  the  setting  of  the  CON‐
>FIG_ADVISE_SYSCALLS configuration option.
>
> CONFORMING TO
>The process_madvise() system call is Linux-specific.
>
> NOTES
>Glibc does not provide a wrapper for this system call; call it us‐
>ing syscall(2).
>
> SEE ALSO
>madvise(2),  pidfd_open(2),   process_vm_readv(2),
>process_vm_write(2)
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-16 Thread Suren Baghdasaryan

Hi Michael,

On Sat, Feb 13, 2021 at 2:04 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> On 2/2/21 11:12 PM, Suren Baghdasaryan wrote:
> > Hi Michael,
> >
> > On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Suren (and Minchan and Michal)
> >>
> >> Thank you for the revisions!
> >>
> >> I've applied this patch, and done a few light edits.
> >
> > Thanks!
> >
> >>
> >> However, I have a questions about undocumented pieces in *madvise(2)*,
> >> as well as one other question. See below.
> >>
> >> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> >>> Initial version of process_madvise(2) manual page. Initial text was
> >>> extracted from [1], amended after fix [2] and more details added using
> >>> man pages of madvise(2) and process_vm_read(2) as examples. It also
> >>> includes the changes to required permission proposed in [3].
> >>>
> >>> [1] https://lore.kernel.org/patchwork/patch/1297933/
> >>> [2] https://lkml.org/lkml/2020/12/8/1282
> >>> [3] 
> >>> https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
> >>>
> >>> Signed-off-by: Suren Baghdasaryan 
> >>> Reviewed-by: Michal Hocko 
> >>> ---
> >>> changes in v2:
> >>> - Changed description of MADV_COLD per Michal Hocko's suggestion
> >>> - Applied fixes suggested by Michael Kerrisk
> >>> changes in v3:
> >>> - Added Michal's Reviewed-by
> >>> - Applied additional fixes suggested by Michael Kerrisk
> >>>
> >>> NAME
> >>> process_madvise - give advice about use of memory to a process
> >>>
> >>> SYNOPSIS
> >>> #include 
> >>>
> >>> ssize_t process_madvise(int pidfd,
> >>>const struct iovec *iovec,
> >>>unsigned long vlen,
> >>>int advice,
> >>>unsigned int flags);
> >>>
> >>> DESCRIPTION
> >>> The process_madvise() system call is used to give advice or directions
> >>> to the kernel about the address ranges of another process or the 
> >>> calling
> >>> process. It provides the advice to the address ranges described by 
> >>> iovec
> >>> and vlen. The goal of such advice is to improve system or application
> >>> performance.
> >>>
> >>> The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> >>> specifies the process to which the advice is to be applied.
> >>>
> >>> The pointer iovec points to an array of iovec structures, defined in
> >>>  as:
> >>>
> >>> struct iovec {
> >>> void  *iov_base;/* Starting address */
> >>> size_t iov_len; /* Number of bytes to transfer */
> >>> };
> >>>
> >>> The iovec structure describes address ranges beginning at iov_base 
> >>> address
> >>> and with the size of iov_len bytes.
> >>>
> >>> The vlen represents the number of elements in the iovec structure.
> >>>
> >>> The advice argument is one of the values listed below.
> >>>
> >>>   Linux-specific advice values
> >>> The following Linux-specific advice values have no counterparts in the
> >>> POSIX-specified posix_madvise(3), and may or may not have counterparts
> >>> in the madvise(2) interface available on other implementations.
> >>>
> >>> MADV_COLD (since Linux 5.4.1)
> >>
> >> I just noticed these version numbers now, and thought: they can't be
> >> right (because the system call appeared only in v5.11). So I removed
> >> them. But, of course in another sense the version numbers are (nearly)
> >> right, since these advice values were added for madvise(2) in Linux 5.4.
> >> However, they are not documented in the madvise(2) manual page. Is it
> >> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
> >> meaning in madvise(2) (but just for the calling process, of course)?
> >
> > Correct. They should be added in the madvise(2) man page as well IMHO.
>
> So, I decided to move the description of MADV_C

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-10 Thread Suren Baghdasaryan

On Wed, Feb 10, 2021 at 10:32 AM Christian König
 wrote:
>
>
>
> Am 10.02.21 um 17:39 schrieb Suren Baghdasaryan:
> > On Wed, Feb 10, 2021 at 5:06 AM Daniel Vetter  wrote:
> >> On Tue, Feb 09, 2021 at 12:16:51PM -0800, Suren Baghdasaryan wrote:
> >>> On Tue, Feb 9, 2021 at 12:03 PM Daniel Vetter  wrote:
> >>>> On Tue, Feb 9, 2021 at 6:46 PM Christian König 
> >>>>  wrote:
> >>>>>
> >>>>>
> >>>>> Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> >>>>>> On Tue, Feb 9, 2021 at 4:57 AM Christian König 
> >>>>>>  wrote:
> >>>>>>> Am 09.02.21 um 13:11 schrieb Christian König:
> >>>>>>>> [SNIP]
> >>>>>>>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct page 
> >>>>>>>>>>> *page)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + spin_lock(>lock);
> >>>>>>>>>>> + list_add_tail(>lru, >items);
> >>>>>>>>>>> + pool->count++;
> >>>>>>>>>>> + atomic_long_add(1 << pool->order, _pages);
> >>>>>>>>>>> + spin_unlock(>lock);
> >>>>>>>>>>> +
> >>>>>>>>>>> + mod_node_page_state(page_pgdat(page),
> >>>>>>>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> >>>>>>>>>>> + 1 << pool->order);
> >>>>>>>>>> Hui what? What should that be good for?
> >>>>>>>>> This is a carryover from the ION page pool implementation:
> >>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28data=04%7C01%7Cchristian.koenig%40amd.com%7Cbb7155447ee149a49f3a08d8cde2685d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637485719618339413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=IYsJoAd7SUo12V7tS3CCRqNVm569iy%2FtoXQqm2MdC1g%3Dreserved=0
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> My sense is it helps with the vmstat/meminfo accounting so folks can
> >>>>>>>>> see the cached pages are shrinkable/freeable. This maybe falls under
> >>>>>>>>> other dmabuf accounting/stats discussions, so I'm happy to remove it
> >>>>>>>>> for now, or let the drivers using the shared page pool logic handle
> >>>>>>>>> the accounting themselves?
> >>>>>>> Intentionally separated the discussion for that here.
> >>>>>>>
> >>>>>>> As far as I can see this is just bluntly incorrect.
> >>>>>>>
> >>>>>>> Either the page is reclaimable or it is part of our pool and freeable
> >>>>>>> through the shrinker, but never ever both.
> >>>>>> IIRC the original motivation for counting ION pooled pages as
> >>>>>> reclaimable was to include them into /proc/meminfo's MemAvailable
> >>>>>> calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> >>>>>> non-slab kernel pages" seems like a good place to account for them but
> >>>>>> I might be wrong.
> >>>>> Yeah, that's what I see here as well. But exactly that is utterly 
> >>>>> nonsense.
> >>>>>
> >>>>> Those pages are not "free" in the sense that get_free_page could return
> >>>>> them directly.
> >>>> Well on Android that is kinda true, because Android has it's
> >>>> oom-killer (way back it was just a shrinker callback, not sure how it
> >>>> works now), which just shot down all the background apps. So at least
> >>>> some of that (everything used by background apps) is indeed
> >>>> reclaimable on Android.
> >>>>
> >>>> But that doesn't hold on Linux in general, so we can't really do this
> >>>> for common code.
> >>>>
> >>>> Also I had a long meeting with Suren, John and other googles
> >>>> yesterday,

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-10 Thread Suren Baghdasaryan

On Wed, Feb 10, 2021 at 9:21 AM Daniel Vetter  wrote:
>
> On Wed, Feb 10, 2021 at 5:39 PM Suren Baghdasaryan  wrote:
> >
> > On Wed, Feb 10, 2021 at 5:06 AM Daniel Vetter  wrote:
> > >
> > > On Tue, Feb 09, 2021 at 12:16:51PM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Feb 9, 2021 at 12:03 PM Daniel Vetter  wrote:
> > > > >
> > > > > On Tue, Feb 9, 2021 at 6:46 PM Christian König 
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> > > > > > > On Tue, Feb 9, 2021 at 4:57 AM Christian König 
> > > > > > >  wrote:
> > > > > > >> Am 09.02.21 um 13:11 schrieb Christian König:
> > > > > > >>> [SNIP]
> > > > > > >>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct 
> > > > > > >>>>>> page *page)
> > > > > > >>>>>> +{
> > > > > > >>>>>> + spin_lock(>lock);
> > > > > > >>>>>> + list_add_tail(>lru, >items);
> > > > > > >>>>>> + pool->count++;
> > > > > > >>>>>> + atomic_long_add(1 << pool->order, _pages);
> > > > > > >>>>>> + spin_unlock(>lock);
> > > > > > >>>>>> +
> > > > > > >>>>>> + mod_node_page_state(page_pgdat(page),
> > > > > > >>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> > > > > > >>>>>> + 1 << pool->order);
> > > > > > >>>>> Hui what? What should that be good for?
> > > > > > >>>> This is a carryover from the ION page pool implementation:
> > > > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28data=04%7C01%7Cchristian.koenig%40amd.com%7Cdff8edcd4d147a5b08d8cd20cff2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484888114923580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=9%2BIBC0tezSV6Ci4S3kWfW%2BQvJm4mdunn3dF6C0kyfCw%3Dreserved=0
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> My sense is it helps with the vmstat/meminfo accounting so 
> > > > > > >>>> folks can
> > > > > > >>>> see the cached pages are shrinkable/freeable. This maybe falls 
> > > > > > >>>> under
> > > > > > >>>> other dmabuf accounting/stats discussions, so I'm happy to 
> > > > > > >>>> remove it
> > > > > > >>>> for now, or let the drivers using the shared page pool logic 
> > > > > > >>>> handle
> > > > > > >>>> the accounting themselves?
> > > > > > >> Intentionally separated the discussion for that here.
> > > > > > >>
> > > > > > >> As far as I can see this is just bluntly incorrect.
> > > > > > >>
> > > > > > >> Either the page is reclaimable or it is part of our pool and 
> > > > > > >> freeable
> > > > > > >> through the shrinker, but never ever both.
> > > > > > > IIRC the original motivation for counting ION pooled pages as
> > > > > > > reclaimable was to include them into /proc/meminfo's MemAvailable
> > > > > > > calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> > > > > > > non-slab kernel pages" seems like a good place to account for 
> > > > > > > them but
> > > > > > > I might be wrong.
> > > > > >
> > > > > > Yeah, that's what I see here as well. But exactly that is utterly 
> > > > > > nonsense.
> > > > > >
> > > > > > Those pages are not "free" in the sense that get_free_page could 
> > > > > > return
> > > > > > them directly.
> > > > >
> > > > > W

Re: [PATCH] dma-buf: system_heap: do not warn for costly allocation

2021-02-10 Thread Suren Baghdasaryan

The code looks fine to me. Description needs a bit polishing :)

On Wed, Feb 10, 2021 at 8:26 AM Minchan Kim  wrote:
>
> Linux VM is not hard to support PAGE_ALLOC_COSTLY_ODER allocation
> so normally expects driver passes __GFP_NOWARN in that case
> if they has fallback options.
>
> system_heap in dmabuf is the case so do not flood into demsg
> with the warning for recording more precious information logs.
> (below is ION warning example I got but dmabuf system heap is
> nothing different).

Suggestion:
Dmabuf system_heap allocation logic starts with the highest necessary
allocation order before falling back to lower orders. The requested
order can be higher than PAGE_ALLOC_COSTLY_ODER and failures to
allocate will flood dmesg with warnings. Such high-order allocations
are not unexpected and are handled by the system_heap's allocation
fallback mechanism.
Prevent these warnings when allocating higher than
PAGE_ALLOC_COSTLY_ODER pages using __GFP_NOWARN flag.

Below is ION warning example I got but dmabuf system heap is nothing different:

>
> [ 1233.911533][  T460] warn_alloc: 11 callbacks suppressed
> [ 1233.911539][  T460] allocator@2.0-s: page allocation failure: order:4, 
> mode:0x140dc2(GFP_HIGHUSER|__GFP_COMP|__GFP_ZERO), 
> nodemask=(null),cpuset=/,mems_allowed=0
> [ 1233.926235][  T460] Call trace:
> [ 1233.929370][  T460]  dump_backtrace+0x0/0x1d8
> [ 1233.933704][  T460]  show_stack+0x18/0x24
> [ 1233.937701][  T460]  dump_stack+0xc0/0x140
> [ 1233.941783][  T460]  warn_alloc+0xf4/0x148
> [ 1233.945862][  T460]  __alloc_pages_slowpath+0x9fc/0xa10
> [ 1233.951101][  T460]  __alloc_pages_nodemask+0x278/0x2c0
> [ 1233.956285][  T460]  ion_page_pool_alloc+0xd8/0x100
> [ 1233.961144][  T460]  ion_system_heap_allocate+0xbc/0x2f0
> [ 1233.966440][  T460]  ion_buffer_create+0x68/0x274
> [ 1233.971130][  T460]  ion_buffer_alloc+0x8c/0x110
> [ 1233.975733][  T460]  ion_dmabuf_alloc+0x44/0xe8
> [ 1233.980248][  T460]  ion_ioctl+0x100/0x320
> [ 1233.984332][  T460]  __arm64_sys_ioctl+0x90/0xc8
> [ 1233.988934][  T460]  el0_svc_common+0x9c/0x168
> [ 1233.993360][  T460]  do_el0_svc+0x1c/0x28
> [ 1233.997358][  T460]  el0_sync_handler+0xd8/0x250
> [ 1234.001989][  T460]  el0_sync+0x148/0x180
>
> Signed-off-by: Minchan Kim 
> ---
>  drivers/dma-buf/heaps/system_heap.c | 9 +++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c 
> b/drivers/dma-buf/heaps/system_heap.c
> index 29e49ac17251..33c25a5e06f9 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -40,7 +40,7 @@ struct dma_heap_attachment {
> bool mapped;
>  };
>
> -#define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
> +#define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO \
> | __GFP_NORETRY) & ~__GFP_RECLAIM) \
> | __GFP_COMP)
>  #define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
> @@ -315,6 +315,7 @@ static struct page *alloc_largest_available(unsigned long 
> size,
> unsigned int max_order)
>  {
> struct page *page;
> +   unsigned long gfp_flags;
> int i;
>
> for (i = 0; i < NUM_ORDERS; i++) {
> @@ -323,7 +324,11 @@ static struct page *alloc_largest_available(unsigned 
> long size,
> if (max_order < orders[i])
> continue;
>
> -   page = alloc_pages(order_flags[i], orders[i]);
> +   gfp_flags = order_flags[i];
> +   if (orders[i] > PAGE_ALLOC_COSTLY_ORDER)
> +   gfp_flags |= __GFP_NOWARN;
> +
> +   page = alloc_pages(gfp_flags, orders[i]);
> if (!page)
> continue;
> return page;
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-10 Thread Suren Baghdasaryan

On Wed, Feb 10, 2021 at 5:06 AM Daniel Vetter  wrote:
>
> On Tue, Feb 09, 2021 at 12:16:51PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Feb 9, 2021 at 12:03 PM Daniel Vetter  wrote:
> > >
> > > On Tue, Feb 9, 2021 at 6:46 PM Christian König  
> > > wrote:
> > > >
> > > >
> > > >
> > > > Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> > > > > On Tue, Feb 9, 2021 at 4:57 AM Christian König 
> > > > >  wrote:
> > > > >> Am 09.02.21 um 13:11 schrieb Christian König:
> > > > >>> [SNIP]
> > > > >>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct page 
> > > > >>>>>> *page)
> > > > >>>>>> +{
> > > > >>>>>> + spin_lock(>lock);
> > > > >>>>>> + list_add_tail(>lru, >items);
> > > > >>>>>> + pool->count++;
> > > > >>>>>> + atomic_long_add(1 << pool->order, _pages);
> > > > >>>>>> + spin_unlock(>lock);
> > > > >>>>>> +
> > > > >>>>>> + mod_node_page_state(page_pgdat(page),
> > > > >>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> > > > >>>>>> + 1 << pool->order);
> > > > >>>>> Hui what? What should that be good for?
> > > > >>>> This is a carryover from the ION page pool implementation:
> > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28data=04%7C01%7Cchristian.koenig%40amd.com%7Cdff8edcd4d147a5b08d8cd20cff2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484888114923580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=9%2BIBC0tezSV6Ci4S3kWfW%2BQvJm4mdunn3dF6C0kyfCw%3Dreserved=0
> > > > >>>>
> > > > >>>>
> > > > >>>> My sense is it helps with the vmstat/meminfo accounting so folks 
> > > > >>>> can
> > > > >>>> see the cached pages are shrinkable/freeable. This maybe falls 
> > > > >>>> under
> > > > >>>> other dmabuf accounting/stats discussions, so I'm happy to remove 
> > > > >>>> it
> > > > >>>> for now, or let the drivers using the shared page pool logic handle
> > > > >>>> the accounting themselves?
> > > > >> Intentionally separated the discussion for that here.
> > > > >>
> > > > >> As far as I can see this is just bluntly incorrect.
> > > > >>
> > > > >> Either the page is reclaimable or it is part of our pool and freeable
> > > > >> through the shrinker, but never ever both.
> > > > > IIRC the original motivation for counting ION pooled pages as
> > > > > reclaimable was to include them into /proc/meminfo's MemAvailable
> > > > > calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> > > > > non-slab kernel pages" seems like a good place to account for them but
> > > > > I might be wrong.
> > > >
> > > > Yeah, that's what I see here as well. But exactly that is utterly 
> > > > nonsense.
> > > >
> > > > Those pages are not "free" in the sense that get_free_page could return
> > > > them directly.
> > >
> > > Well on Android that is kinda true, because Android has it's
> > > oom-killer (way back it was just a shrinker callback, not sure how it
> > > works now), which just shot down all the background apps. So at least
> > > some of that (everything used by background apps) is indeed
> > > reclaimable on Android.
> > >
> > > But that doesn't hold on Linux in general, so we can't really do this
> > > for common code.
> > >
> > > Also I had a long meeting with Suren, John and other googles
> > > yesterday, and the aim is now to try and support all the Android gpu
> > > memory accounting needs with cgroups. That should work, and it will
> > > allow Android to handle all the Android-ism in a clean way in upstream
> > > code. Or that's at least the plan.
> >

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-09 Thread Suren Baghdasaryan

On Tue, Feb 9, 2021 at 12:03 PM Daniel Vetter  wrote:
>
> On Tue, Feb 9, 2021 at 6:46 PM Christian König  
> wrote:
> >
> >
> >
> > Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> > > On Tue, Feb 9, 2021 at 4:57 AM Christian König  
> > > wrote:
> > >> Am 09.02.21 um 13:11 schrieb Christian König:
> > >>> [SNIP]
> > >>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct page 
> > >>>>>> *page)
> > >>>>>> +{
> > >>>>>> + spin_lock(>lock);
> > >>>>>> + list_add_tail(>lru, >items);
> > >>>>>> + pool->count++;
> > >>>>>> + atomic_long_add(1 << pool->order, _pages);
> > >>>>>> + spin_unlock(>lock);
> > >>>>>> +
> > >>>>>> + mod_node_page_state(page_pgdat(page),
> > >>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> > >>>>>> + 1 << pool->order);
> > >>>>> Hui what? What should that be good for?
> > >>>> This is a carryover from the ION page pool implementation:
> > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28data=04%7C01%7Cchristian.koenig%40amd.com%7Cdff8edcd4d147a5b08d8cd20cff2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484888114923580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=9%2BIBC0tezSV6Ci4S3kWfW%2BQvJm4mdunn3dF6C0kyfCw%3Dreserved=0
> > >>>>
> > >>>>
> > >>>> My sense is it helps with the vmstat/meminfo accounting so folks can
> > >>>> see the cached pages are shrinkable/freeable. This maybe falls under
> > >>>> other dmabuf accounting/stats discussions, so I'm happy to remove it
> > >>>> for now, or let the drivers using the shared page pool logic handle
> > >>>> the accounting themselves?
> > >> Intentionally separated the discussion for that here.
> > >>
> > >> As far as I can see this is just bluntly incorrect.
> > >>
> > >> Either the page is reclaimable or it is part of our pool and freeable
> > >> through the shrinker, but never ever both.
> > > IIRC the original motivation for counting ION pooled pages as
> > > reclaimable was to include them into /proc/meminfo's MemAvailable
> > > calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> > > non-slab kernel pages" seems like a good place to account for them but
> > > I might be wrong.
> >
> > Yeah, that's what I see here as well. But exactly that is utterly nonsense.
> >
> > Those pages are not "free" in the sense that get_free_page could return
> > them directly.
>
> Well on Android that is kinda true, because Android has it's
> oom-killer (way back it was just a shrinker callback, not sure how it
> works now), which just shot down all the background apps. So at least
> some of that (everything used by background apps) is indeed
> reclaimable on Android.
>
> But that doesn't hold on Linux in general, so we can't really do this
> for common code.
>
> Also I had a long meeting with Suren, John and other googles
> yesterday, and the aim is now to try and support all the Android gpu
> memory accounting needs with cgroups. That should work, and it will
> allow Android to handle all the Android-ism in a clean way in upstream
> code. Or that's at least the plan.
>
> I think the only thing we identified that Android still needs on top
> is the dma-buf sysfs stuff, so that shared buffers (which on Android
> are always dma-buf, and always stay around as dma-buf fd throughout
> their lifetime) can be listed/analyzed with full detail.
>
> But aside from this the plan for all the per-process or per-heap
> account, oom-killer integration and everything else is planned to be
> done with cgroups.

Until cgroups are ready we probably will need to add a sysfs node to
report the total dmabuf pool size and I think that would cover our
current accounting need here.
As I mentioned, not including dmabuf pools into MemAvailable would
affect that stat and I'm wondering if pools should be considered as
part of MemAvailable or not. Since MemAvailable includes SReclaimable
I think it makes sense to include them but maybe there are other
considerations that I'm missing?

> Android (for now) only needs to account overall gpu
> memory since none of it is swappable on android drivers anyway, plus
> no vram, so not much needed.
>
> Cheers, Daniel
>
> >
> > Regards,
> > Christian.
> >
> > >
> > >> In the best case this just messes up the accounting, in the worst case
> > >> it can cause memory corruption.
> > >>
> > >> Christian.
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-09 Thread Suren Baghdasaryan

On Tue, Feb 9, 2021 at 9:46 AM Christian König  wrote:
>
>
>
> Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> > On Tue, Feb 9, 2021 at 4:57 AM Christian König  
> > wrote:
> >> Am 09.02.21 um 13:11 schrieb Christian König:
> >>> [SNIP]
> >>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct page *page)
> >>>>>> +{
> >>>>>> + spin_lock(>lock);
> >>>>>> + list_add_tail(>lru, >items);
> >>>>>> + pool->count++;
> >>>>>> + atomic_long_add(1 << pool->order, _pages);
> >>>>>> + spin_unlock(>lock);
> >>>>>> +
> >>>>>> + mod_node_page_state(page_pgdat(page),
> >>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> >>>>>> + 1 << pool->order);
> >>>>> Hui what? What should that be good for?
> >>>> This is a carryover from the ION page pool implementation:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28data=04%7C01%7Cchristian.koenig%40amd.com%7Cdff8edcd4d147a5b08d8cd20cff2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484888114923580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=9%2BIBC0tezSV6Ci4S3kWfW%2BQvJm4mdunn3dF6C0kyfCw%3Dreserved=0
> >>>>
> >>>>
> >>>> My sense is it helps with the vmstat/meminfo accounting so folks can
> >>>> see the cached pages are shrinkable/freeable. This maybe falls under
> >>>> other dmabuf accounting/stats discussions, so I'm happy to remove it
> >>>> for now, or let the drivers using the shared page pool logic handle
> >>>> the accounting themselves?
> >> Intentionally separated the discussion for that here.
> >>
> >> As far as I can see this is just bluntly incorrect.
> >>
> >> Either the page is reclaimable or it is part of our pool and freeable
> >> through the shrinker, but never ever both.
> > IIRC the original motivation for counting ION pooled pages as
> > reclaimable was to include them into /proc/meminfo's MemAvailable
> > calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> > non-slab kernel pages" seems like a good place to account for them but
> > I might be wrong.
>
> Yeah, that's what I see here as well. But exactly that is utterly nonsense.
>
> Those pages are not "free" in the sense that get_free_page could return
> them directly.

Any ideas where these pages would fit better? We do want to know that
under memory pressure these pages can be made available (which is I
think what MemAvailable means).

>
> Regards,
> Christian.
>
> >
> >> In the best case this just messes up the accounting, in the worst case
> >> it can cause memory corruption.
> >>
> >> Christian.
>

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-09 Thread Suren Baghdasaryan

On Tue, Feb 9, 2021 at 4:57 AM Christian König  wrote:
>
> Am 09.02.21 um 13:11 schrieb Christian König:
> > [SNIP]
>  +void drm_page_pool_add(struct drm_page_pool *pool, struct page *page)
>  +{
>  + spin_lock(>lock);
>  + list_add_tail(>lru, >items);
>  + pool->count++;
>  + atomic_long_add(1 << pool->order, _pages);
>  + spin_unlock(>lock);
>  +
>  + mod_node_page_state(page_pgdat(page),
>  NR_KERNEL_MISC_RECLAIMABLE,
>  + 1 << pool->order);
> >>> Hui what? What should that be good for?
> >> This is a carryover from the ION page pool implementation:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28data=04%7C01%7Cchristian.koenig%40amd.com%7Cc4eadb0a9cf6491d99ba08d8ca173457%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637481548325174885%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=FUjZK5NSDMUYfU7vGeE4fDU2HCF%2FYyNBwc30aoLLPQ4%3Dreserved=0
> >>
> >>
> >> My sense is it helps with the vmstat/meminfo accounting so folks can
> >> see the cached pages are shrinkable/freeable. This maybe falls under
> >> other dmabuf accounting/stats discussions, so I'm happy to remove it
> >> for now, or let the drivers using the shared page pool logic handle
> >> the accounting themselves?
>
> Intentionally separated the discussion for that here.
>
> As far as I can see this is just bluntly incorrect.
>
> Either the page is reclaimable or it is part of our pool and freeable
> through the shrinker, but never ever both.

IIRC the original motivation for counting ION pooled pages as
reclaimable was to include them into /proc/meminfo's MemAvailable
calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
non-slab kernel pages" seems like a good place to account for them but
I might be wrong.

>
> In the best case this just messes up the accounting, in the worst case
> it can cause memory corruption.
>
> Christian.

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-05 Thread Suren Baghdasaryan

On Fri, Feb 5, 2021 at 12:47 PM John Stultz  wrote:
>
> On Fri, Feb 5, 2021 at 12:47 AM Christian König
>  wrote:
> > Am 05.02.21 um 09:06 schrieb John Stultz:
> > > diff --git a/drivers/gpu/drm/page_pool.c b/drivers/gpu/drm/page_pool.c
> > > new file mode 100644
> > > index ..2139f86e6ca7
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/page_pool.c
> > > @@ -0,0 +1,220 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> >
> > Please use a BSD/MIT compatible license if you want to copy this from
> > the TTM code.
>
> Hrm. This may be problematic, as it's not just TTM code, but some of
> the TTM logic integrated into a page-pool implementation I wrote based
> on logic from the ION code (which was GPL-2.0 before it was dropped).
> So I don't think I can just make it MIT.  Any extra context on the
> need for MIT, or suggestions on how to best resolve this?
>
> > > +int drm_page_pool_get_size(struct drm_page_pool *pool)
> > > +{
> > > + int ret;
> > > +
> > > + spin_lock(>lock);
> > > + ret = pool->count;
> > > + spin_unlock(>lock);
> >
> > Maybe use an atomic for the count instead?
> >
>
> I can do that, but am curious as to the benefit? We are mostly using
> count where we already have to take the pool->lock anyway, and this
> drm_page_pool_get_size() is only used for debugfs output so far, so I
> don't expect it to be a hot path.
>
>
> > > +void drm_page_pool_add(struct drm_page_pool *pool, struct page *page)
> > > +{
> > > + spin_lock(>lock);
> > > + list_add_tail(>lru, >items);
> > > + pool->count++;
> > > + atomic_long_add(1 << pool->order, _pages);
> > > + spin_unlock(>lock);
> > > +
> > > + mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE,
> > > + 1 << pool->order);
> >
> > Hui what? What should that be good for?
>
> This is a carryover from the ION page pool implementation:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/android/ion/ion_page_pool.c?h=v5.10#n28
>
> My sense is it helps with the vmstat/meminfo accounting so folks can
> see the cached pages are shrinkable/freeable. This maybe falls under
> other dmabuf accounting/stats discussions, so I'm happy to remove it
> for now, or let the drivers using the shared page pool logic handle
> the accounting themselves?

Yep, ION pools were accounted for as reclaimable kernel memory because
they could be dropped when the system is under memory pressure.

>
>
> > > +static struct page *drm_page_pool_remove(struct drm_page_pool *pool)
> > > +{
> > > + struct page *page;
> > > +
> > > + if (!pool->count)
> > > + return NULL;
> >
> > Better use list_first_entry_or_null instead of checking the count.
> >
> > This way you can also pull the lock into the function.
>
> Yea, that cleans a number of things up nicely. Thank you!
>
>
> > > +struct drm_page_pool *drm_page_pool_create(unsigned int order,
> > > +int (*free_page)(struct page *p, 
> > > unsigned int order))
> > > +{
> > > + struct drm_page_pool *pool = kmalloc(sizeof(*pool), GFP_KERNEL);
> >
> > Why not making this an embedded object? We should not see much dynamic
> > pool creation.
>
> Yea, it felt cleaner at the time this way, but I think I will need to
> switch to an embedded object in order to resolve the memory usage
> issue you pointed out with growing the ttm_pool_dma, so thank you for
> the suggestion!
>
>
> > > +void drm_page_pool_destroy(struct drm_page_pool *pool)
> > > +{
> > > + struct page *page;
> > > +
> > > + /* Remove us from the pool list */
> > > + mutex_lock(_list_lock);
> > > + list_del(>list);
> > > + mutex_unlock(_list_lock);
> > > +
> > > + /* Free any remaining pages in the pool */
> > > + spin_lock(>lock);
> >
> > Locking should be unnecessary when the pool is destroyed anyway.
>
> I guess if we've already pruned ourself from the pool list, then your
> right, we can't race with the shrinker and it's maybe not necessary.
> But it also seems easier to consistently follow the locking rules in a
> very unlikely path rather than leaning on subtlety.  Either way, I
> think this becomes moot if I make the improvements you suggest to
> drm_page_pool_remove().
>
> > > +static int drm_page_pool_shrink_one(void)
> > > +{
> > > + struct drm_page_pool *pool;
> > > + struct page *page;
> > > + int nr_freed = 0;
> > > +
> > > + mutex_lock(_list_lock);
> > > + pool = list_first_entry(_list, typeof(*pool), list);
> > > +
> > > + spin_lock(>lock);
> > > + page = drm_page_pool_remove(pool);
> > > + spin_unlock(>lock);
> > > +
> > > + if (page)
> > > + nr_freed = drm_page_pool_free_pages(pool, page);
> > > +
> > > + list_move_tail(>list, _list);
> >
> > Better to move this up, directly after the list_first_entry().
>
> Sounds good!
>
> Thanks so much for your review and feedback! I'll try to get some of
> the easy suggestions

Re: [PATCH] mm: cma: support sysfs

2021-02-05 Thread Suren Baghdasaryan

On Fri, Feb 5, 2021 at 1:28 PM Minchan Kim  wrote:
>
> On Fri, Feb 05, 2021 at 12:25:52PM -0800, John Hubbard wrote:
> > On 2/5/21 8:15 AM, Minchan Kim wrote:
> > ...
> > > > Yes, approximately. I was wondering if this would suffice at least as a 
> > > > baseline:
> > > >
> > > > cma_alloc_success   125
> > > > cma_alloc_failure   25
> > >
> > > IMO, regardless of the my patch, it would be good to have such statistics
> > > in that CMA was born to replace carved out memory with dynamic allocation
> > > ideally for memory efficiency ideally so failure should regard critical
> > > so admin could notice it how the system is hurt.
> >
> > Right. So CMA failures are useful for the admin to see, understood.
> >
> > >
> > > Anyway, it's not enough for me and orthgonal with my goal.
> > >
> >
> > OK. But...what *is* your goal, and why is this useless (that's what
> > orthogonal really means here) for your goal?
>
> As I mentioned, the goal is to monitor the failure from each of CMA
> since they have each own purpose.
>
> Let's have an example.
>
> System has 5 CMA area and each CMA is associated with each
> user scenario. They have exclusive CMA area to avoid
> fragmentation problem.
>
> CMA-1 depends on bluetooh
> CMA-2 depends on WIFI
> CMA-3 depends on sensor-A
> CMA-4 depends on sensor-B
> CMA-5 depends on sensor-C
>
> With this, we could catch which module was affected but with global failure,
> I couldn't find who was affected.
>
> >
> > Also, would you be willing to try out something simple first,
> > such as providing indication that cma is active and it's overall success
> > rate, like this:
> >
> > /proc/vmstat:
> >
> > cma_alloc_success   125
> > cma_alloc_failure   25
> >
> > ...or is the only way to provide the more detailed items, complete with
> > per-CMA details, in a non-debugfs location?
> >
> >
> > > >
> > > > ...and then, to see if more is needed, some questions:
> > > >
> > > > a)  Do you know of an upper bound on how many cma areas there can be
> > > > (I think Matthew also asked that)?
> > >
> > > There is no upper bound since it's configurable.
> > >
> >
> > OK, thanks,so that pretty much rules out putting per-cma details into
> > anything other than a directory or something like it.
> >
> > > >
> > > > b) Is tracking the cma area really as valuable as other possibilities? 
> > > > We can put
> > > > "a few" to "several" items here, so really want to get your very 
> > > > favorite bits of
> > > > information in. If, for example, there can be *lots* of cma areas, then 
> > > > maybe tracking
> > >
> > > At this moment, allocation/failure for each CMA area since they have
> > > particular own usecase, which makes me easy to keep which module will
> > > be affected. I think it is very useful per-CMA statistics as minimum
> > > code change so I want to enable it by default under CONFIG_CMA && 
> > > CONFIG_SYSFS.
> > >
> > > > by a range of allocation sizes is better...
> > >
> > > I takes your suggestion something like this.
> > >
> > > [alloc_range] could be order or range by interval
> > >
> > > /sys/kernel/mm/cma/cma-A/[alloc_range]/success
> > > /sys/kernel/mm/cma/cma-A/[alloc_range]/fail
> > > ..
> > > ..
> > > /sys/kernel/mm/cma/cma-Z/[alloc_range]/success
> > > /sys/kernel/mm/cma/cma-Z/[alloc_range]/fail

The interface above seems to me the most useful actually, if by
[alloc_range] you mean the different allocation orders. This would
cover Minchan's per-CMA failure tracking and would also allow us to
understand what kind of allocations are failing and therefore if the
problem is caused by pinning/fragmentation or by over-utilization.

> >
> > Actually, I meant, "ranges instead of cma areas", like this:
> >
> > / > / > / > / > ...
> > / > / >
> > The idea is that knowing the allocation sizes that succeeded
> > and failed is maybe even more interesting and useful than
> > knowing the cma area that contains them.
>
> Understand your point but it would make hard to find who was
> affected by the failure. That's why I suggested to have your
> suggestion under additional config since per-cma metric with
> simple sucess/failure are enough.
>
> >
> > >
> > > I agree it would be also useful but I'd like to enable it under
> > > CONFIG_CMA_SYSFS_ALLOC_RANGE as separate patchset.
> > >
> >
> > I will stop harassing you very soon, just want to bottom out on
> > understanding the real goals first. :)
> >
>
> I hope my example makes the goal more clear for you.

Re: [Linaro-mm-sig] [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 7:55 AM Alex Deucher  wrote:
>
> On Thu, Feb 4, 2021 at 3:16 AM Christian König  
> wrote:
> >
> > Am 03.02.21 um 22:41 schrieb Suren Baghdasaryan:
> > > [SNIP]
> > >>> How many semi-unrelated buffer accounting schemes does google come up 
> > >>> with?
> > >>>
> > >>> We're at three with this one.
> > >>>
> > >>> And also we _cannot_ required that all dma-bufs are backed by struct
> > >>> page, so requiring struct page to make this work is a no-go.
> > >>>
> > >>> Second, we do not want to all get_user_pages and friends to work on
> > >>> dma-buf, it causes all kinds of pain. Yes on SoC where dma-buf are
> > >>> exclusively in system memory you can maybe get away with this, but
> > >>> dma-buf is supposed to work in more places than just Android SoCs.
> > >> I just realized that vm_inser_page doesn't even work for CMA, it would
> > >> upset get_user_pages pretty badly - you're trying to pin a page in
> > >> ZONE_MOVEABLE but you can't move it because it's rather special.
> > >> VM_SPECIAL is exactly meant to catch this stuff.
> > > Thanks for the input, Daniel! Let me think about the cases you pointed 
> > > out.
> > >
> > > IMHO, the issue with PSS is the difficulty of calculating this metric
> > > without struct page usage. I don't think that problem becomes easier
> > > if we use cgroups or any other API. I wanted to enable existing PSS
> > > calculation mechanisms for the dmabufs known to be backed by struct
> > > pages (since we know how the heap allocated that memory), but sounds
> > > like this would lead to problems that I did not consider.
> >
> > Yeah, using struct page indeed won't work. We discussed that multiple
> > times now and Daniel even has a patch to mangle the struct page pointers
> > inside the sg_table object to prevent abuse in that direction.
> >
> > On the other hand I totally agree that we need to do something on this
> > side which goes beyong what cgroups provide.
> >
> > A few years ago I came up with patches to improve the OOM killer to
> > include resources bound to the processes through file descriptors. I
> > unfortunately can't find them of hand any more and I'm currently to busy
> > to dig them up.
>
> https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html
> I think there was a more recent discussion, but I can't seem to find it.

Thanks for the pointer!
Appreciate the time everyone took to explain the issues.
Thanks,
Suren.

>
> Alex
>
> >
> > In general I think we need to make it possible that both the in kernel
> > OOM killer as well as userspace processes and handlers have access to
> > that kind of data.
> >
> > The fdinfo approach as suggested in the other thread sounds like the
> > easiest solution to me.
> >
> > Regards,
> > Christian.
> >
> > > Thanks,
> > > Suren.
> > >
> > >
> >
> > ___
> > dri-devel mailing list
> > dri-de...@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 5:44 PM Minchan Kim  wrote:
>
> On Thu, Feb 04, 2021 at 04:24:20PM -0800, John Hubbard wrote:
> > On 2/4/21 4:12 PM, Minchan Kim wrote:
> > ...
> > > > > Then, how to know how often CMA API failed?
> > > >
> > > > Why would you even need to know that, *in addition* to knowing specific
> > > > page allocation numbers that failed? Again, there is no real-world 
> > > > motivation
> > > > cited yet, just "this is good data". Need more stories and support here.
> > >
> > > Let me give an example.
> > >
> > > Let' assume we use memory buffer allocation via CMA for bluetooth
> > > enable of  device.
> > > If user clicks the bluetooth button in the phone but fail to allocate
> > > the memory from CMA, user will still see bluetooth button gray.
> > > User would think his touch was not enough powerful so he try clicking
> > > again and fortunately CMA allocation was successful this time and
> > > they will see bluetooh button enabled and could listen the music.
> > >
> > > Here, product team needs to monitor how often CMA alloc failed so
> > > if the failure ratio is steadily increased than the bar,
> > > it means engineers need to go investigation.
> > >
> > > Make sense?
> > >
> >
> > Yes, except that it raises more questions:
> >
> > 1) Isn't this just standard allocation failure? Don't you already have a way
> > to track that?
> >
> > Presumably, having the source code, you can easily deduce that a bluetooth
> > allocation failure goes directly to a CMA allocation failure, right?
> >
> > Anyway, even though the above is still a little murky, I expect you're right
> > that it's good to have *some* indication, somewhere about CMA behavior...
> >
> > Thinking about this some more, I wonder if this is really /proc/vmstat sort
> > of data that we're talking about. It seems to fit right in there, yes?
>
> Thing is CMA instance are multiple, cma-A, cma-B, cma-C and each of CMA
> heap has own specific scenario. /proc/vmstat could be bloated a lot
> while CMA instance will be increased.

Oh, I missed the fact that you need these stats per-CMA.

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 4:34 PM John Hubbard  wrote:
>
> On 2/4/21 4:25 PM, John Hubbard wrote:
> > On 2/4/21 3:45 PM, Suren Baghdasaryan wrote:
> > ...
> >>>>>> 2) The overall CMA allocation attempts/failures (first two items 
> >>>>>> above) seem
> >>>>>> an odd pair of things to track. Maybe that is what was easy to track, 
> >>>>>> but I'd
> >>>>>> vote for just omitting them.
> >>>>>
> >>>>> Then, how to know how often CMA API failed?
> >>>>
> >>>> Why would you even need to know that, *in addition* to knowing specific
> >>>> page allocation numbers that failed? Again, there is no real-world 
> >>>> motivation
> >>>> cited yet, just "this is good data". Need more stories and support here.
> >>>
> >>> IMHO it would be very useful to see whether there are multiple
> >>> small-order allocation failures or a few large-order ones, especially
> >>> for CMA where large allocations are not unusual. For that I believe
> >>> both alloc_pages_attempt and alloc_pages_fail would be required.
> >>
> >> Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would
> >> be required".
> >
> > So if you want to know that, the existing items are still a little too 
> > indirect
> > to really get it right. You can only know the average allocation size, by
> > dividing. Instead, we should provide the allocation size, for each count.
> >
> > The limited interface makes this a little awkward, but using zones/ranges 
> > could
> > work: "for this range of allocation sizes, there were the following stats". 
> > Or,
> > some other technique that I haven't thought of (maybe two items per file?) 
> > would
> > be better.
> >
> > On the other hand, there's an argument for keeping this minimal and simple. 
> > That
> > would probably lead us to putting in a couple of items into /proc/vmstat, 
> > as I
> > just mentioned in my other response, and calling it good.

True. I was thinking along these lines but per-order counters felt
like maybe an overkill? I'm all for keeping it simple.

>

> ...and remember: if we keep it nice and minimal and clean, we can put it into
> /proc/vmstat and monitor it.

No objections from me.

>
> And then if a problem shows up, the more complex and advanced debugging data 
> can
> go into debugfs's CMA area. And you're all set.
>
> If Android made up some policy not to use debugfs, then:
>
> a) that probably won't prevent engineers from using it anyway, for advanced 
> debugging,
> and
>
> b) If (a) somehow falls short, then we need to talk about what Android's 
> plans are to
> fill the need. And "fill up sysfs with debugfs items, possibly duplicating 
> some of them,
> and generally making an unecessary mess, to compensate for not using debugfs" 
> is not
> my first choice. :)
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 3:43 PM Suren Baghdasaryan  wrote:
>
> On Thu, Feb 4, 2021 at 3:14 PM John Hubbard  wrote:
> >
> > On 2/4/21 12:07 PM, Minchan Kim wrote:
> > > On Thu, Feb 04, 2021 at 12:50:58AM -0800, John Hubbard wrote:
> > >> On 2/3/21 7:50 AM, Minchan Kim wrote:
> > >>> Since CMA is getting used more widely, it's more important to
> > >>> keep monitoring CMA statistics for system health since it's
> > >>> directly related to user experience.
> > >>>
> > >>> This patch introduces sysfs for the CMA and exposes stats below
> > >>> to keep monitor for telemetric in the system.
> > >>>
> > >>>* the number of CMA allocation attempts
> > >>>* the number of CMA allocation failures
> > >>>* the number of CMA page allocation attempts
> > >>>* the number of CMA page allocation failures
> > >>
> > >> The desire to report CMA data is understandable, but there are a few
> > >> odd things here:
> > >>
> > >> 1) First of all, this has significant overlap with /sys/kernel/debug/cma
> > >> items. I suspect that all of these items could instead go into
> > >
> > > At this moment, I don't see any overlap with item from cma_debugfs.
> > > Could you specify what item you are mentioning?
> >
> > Just the fact that there would be two systems under /sys, both of which are
> > doing very very similar things: providing information that is intended to
> > help diagnose CMA.
> >
> > >
> > >> /sys/kernel/debug/cma, right?
> > >
> > > Anyway, thing is I need an stable interface for that and need to use
> > > it in Android production build, too(Unfortunately, Android deprecated
> > > the debugfs
> > > https://source.android.com/setup/start/android-11-release#debugfs
> > > )
> >
> > That's the closest hint to a "why this is needed" that we've seen yet.
> > But it's only a hint.
> >
> > >
> > > What should be in debugfs and in sysfs? What's the criteria?
> >
> > Well, it's a gray area. "Debugging support" goes into debugfs, and
> > "production-level monitoring and control" goes into sysfs, roughly
> > speaking. And here you have items that could be classified as either.
> >
> > >
> > > Some statistic could be considered about debugging aid or telemetric
> > > depening on view point and usecase. And here, I want to use it for
> > > telemetric, get an stable interface and use it in production build
> > > of Android. In this chance, I'd like to get concrete guideline
> > > what should be in sysfs and debugfs so that pointing out this thread
> > > whenever folks dump their stat into sysfs to avoid waste of time
> > > for others in future. :)
> > >
> > >>
> > >> 2) The overall CMA allocation attempts/failures (first two items above) 
> > >> seem
> > >> an odd pair of things to track. Maybe that is what was easy to track, 
> > >> but I'd
> > >> vote for just omitting them.
> > >
> > > Then, how to know how often CMA API failed?
> >
> > Why would you even need to know that, *in addition* to knowing specific
> > page allocation numbers that failed? Again, there is no real-world 
> > motivation
> > cited yet, just "this is good data". Need more stories and support here.
>
> IMHO it would be very useful to see whether there are multiple
> small-order allocation failures or a few large-order ones, especially
> for CMA where large allocations are not unusual. For that I believe
> both alloc_pages_attempt and alloc_pages_fail would be required.

Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would
be required".

>
> >
> >
> > thanks,
> > --
> > John Hubbard
> > NVIDIA
> >
> > > There are various size allocation request for a CMA so only page
> > > allocation stat are not enough to know it.
> > >
> > >>>
> > >>> Signed-off-by: Minchan Kim 
> > >>> ---
> > >>>Documentation/ABI/testing/sysfs-kernel-mm-cma |  39 +
> > >>>include/linux/cma.h   |   1 +
> > >>>mm/Makefile   |   1 +
> > >>>mm/cma.c  |   6 +-
> > >>>mm/cma.h  |  20 +++

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 3:14 PM John Hubbard  wrote:
>
> On 2/4/21 12:07 PM, Minchan Kim wrote:
> > On Thu, Feb 04, 2021 at 12:50:58AM -0800, John Hubbard wrote:
> >> On 2/3/21 7:50 AM, Minchan Kim wrote:
> >>> Since CMA is getting used more widely, it's more important to
> >>> keep monitoring CMA statistics for system health since it's
> >>> directly related to user experience.
> >>>
> >>> This patch introduces sysfs for the CMA and exposes stats below
> >>> to keep monitor for telemetric in the system.
> >>>
> >>>* the number of CMA allocation attempts
> >>>* the number of CMA allocation failures
> >>>* the number of CMA page allocation attempts
> >>>* the number of CMA page allocation failures
> >>
> >> The desire to report CMA data is understandable, but there are a few
> >> odd things here:
> >>
> >> 1) First of all, this has significant overlap with /sys/kernel/debug/cma
> >> items. I suspect that all of these items could instead go into
> >
> > At this moment, I don't see any overlap with item from cma_debugfs.
> > Could you specify what item you are mentioning?
>
> Just the fact that there would be two systems under /sys, both of which are
> doing very very similar things: providing information that is intended to
> help diagnose CMA.
>
> >
> >> /sys/kernel/debug/cma, right?
> >
> > Anyway, thing is I need an stable interface for that and need to use
> > it in Android production build, too(Unfortunately, Android deprecated
> > the debugfs
> > https://source.android.com/setup/start/android-11-release#debugfs
> > )
>
> That's the closest hint to a "why this is needed" that we've seen yet.
> But it's only a hint.
>
> >
> > What should be in debugfs and in sysfs? What's the criteria?
>
> Well, it's a gray area. "Debugging support" goes into debugfs, and
> "production-level monitoring and control" goes into sysfs, roughly
> speaking. And here you have items that could be classified as either.
>
> >
> > Some statistic could be considered about debugging aid or telemetric
> > depening on view point and usecase. And here, I want to use it for
> > telemetric, get an stable interface and use it in production build
> > of Android. In this chance, I'd like to get concrete guideline
> > what should be in sysfs and debugfs so that pointing out this thread
> > whenever folks dump their stat into sysfs to avoid waste of time
> > for others in future. :)
> >
> >>
> >> 2) The overall CMA allocation attempts/failures (first two items above) 
> >> seem
> >> an odd pair of things to track. Maybe that is what was easy to track, but 
> >> I'd
> >> vote for just omitting them.
> >
> > Then, how to know how often CMA API failed?
>
> Why would you even need to know that, *in addition* to knowing specific
> page allocation numbers that failed? Again, there is no real-world motivation
> cited yet, just "this is good data". Need more stories and support here.

IMHO it would be very useful to see whether there are multiple
small-order allocation failures or a few large-order ones, especially
for CMA where large allocations are not unusual. For that I believe
both alloc_pages_attempt and alloc_pages_fail would be required.

>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > There are various size allocation request for a CMA so only page
> > allocation stat are not enough to know it.
> >
> >>>
> >>> Signed-off-by: Minchan Kim 
> >>> ---
> >>>Documentation/ABI/testing/sysfs-kernel-mm-cma |  39 +
> >>>include/linux/cma.h   |   1 +
> >>>mm/Makefile   |   1 +
> >>>mm/cma.c  |   6 +-
> >>>mm/cma.h  |  20 +++
> >>>mm/cma_sysfs.c| 143 ++
> >>>6 files changed, 209 insertions(+), 1 deletion(-)
> >>>create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma
> >>>create mode 100644 mm/cma_sysfs.c
> >>>
> >>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma 
> >>> b/Documentation/ABI/testing/sysfs-kernel-mm-cma
> >>> new file mode 100644
> >>> index ..2a43c0aacc39
> >>> --- /dev/null
> >>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma
> >>> @@ -0,0 +1,39 @@
> >>> +What:  /sys/kernel/mm/cma/
> >>> +Date:  Feb 2021
> >>> +Contact:   Minchan Kim 
> >>> +Description:
> >>> +   /sys/kernel/mm/cma/ contains a number of subdirectories by
> >>> +   cma-heap name. The subdirectory contains a number of files
> >>> +   to represent cma allocation statistics.
> >>
> >> Somewhere, maybe here, there should be a mention of the closely related
> >> /sys/kernel/debug/cma files.
> >>
> >>> +
> >>> +   There are number of files under
> >>> +   /sys/kernel/mm/cma/ directory
> >>> +
> >>> +   - cma_alloc_attempt
> >>> +   - cma_alloc_fail
> >>
> >> Are these really useful? They a summary of

Re: [Linaro-mm-sig] [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-03 Thread Suren Baghdasaryan

On Wed, Feb 3, 2021 at 1:25 PM Daniel Vetter  wrote:
>
> On Wed, Feb 3, 2021 at 9:29 PM Daniel Vetter  wrote:
> >
> > On Wed, Feb 3, 2021 at 9:20 PM Suren Baghdasaryan  wrote:
> > >
> > > On Wed, Feb 3, 2021 at 12:52 AM Daniel Vetter  
> > > wrote:
> > > >
> > > > On Wed, Feb 3, 2021 at 2:57 AM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > > > > > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > > > > > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > > > > > vm_insert_page that set VM_PFNMAP are notified of the wrong flag 
> > > > > > usage
> > > > > > and get an indication of an error without panicing the kernel.
> > > > > > This will help identifying drivers that need to clear VM_PFNMAP 
> > > > > > before
> > > > > > using dmabuf system heap which is moving to use vm_insert_page.
> > > > >
> > > > > NACK.
> > > > >
> > > > > The system may not _panic_, but it is clearly now _broken_.  The 
> > > > > device
> > > > > doesn't work, and so the system is useless.  You haven't really 
> > > > > improved
> > > > > anything here.  Just bloated the kernel with yet another _ONCE 
> > > > > variable
> > > > > that in a normal system will never ever ever be triggered.
> > > >
> > > > Also, what the heck are you doing with your drivers? dma-buf mmap must
> > > > call dma_buf_mmap(), even for forwarded/redirected mmaps from driver
> > > > char nodes. If that doesn't work we have some issues with the calling
> > > > contract for that function, not in vm_insert_page.
> > >
> > > The particular issue I observed (details were posted in
> > > https://lore.kernel.org/patchwork/patch/1372409) is that DRM drivers
> > > set VM_PFNMAP flag (via a call to drm_gem_mmap_obj) before calling
> > > dma_buf_mmap. Some drivers clear that flag but some don't. I could not
> > > find the answer to why VM_PFNMAP is required for dmabuf mappings and
> > > maybe someone can explain that here?
> > > If there is a reason to set this flag other than historical use of
> > > carveout memory then we wanted to catch such cases and fix the drivers
> > > that moved to using dmabuf heaps. However maybe there are other
> > > reasons and if so I would be very grateful if someone could explain
> > > them. That would help me to come up with a better solution.
> > >
> > > > Finally why exactly do we need to make this switch for system heap?
> > > > I've recently looked at gup usage by random drivers, and found a lot
> > > > of worrying things there. gup on dma-buf is really bad idea in
> > > > general.
> > >
> > > The reason for the switch is to be able to account dmabufs allocated
> > > using dmabuf heaps to the processes that map them. The next patch in
> > > this series https://lore.kernel.org/patchwork/patch/1374851
> > > implementing the switch contains more details and there is an active
> > > discussion there. Would you mind joining that discussion to keep it in
> > > one place?
> >
> > How many semi-unrelated buffer accounting schemes does google come up with?
> >
> > We're at three with this one.
> >
> > And also we _cannot_ required that all dma-bufs are backed by struct
> > page, so requiring struct page to make this work is a no-go.
> >
> > Second, we do not want to all get_user_pages and friends to work on
> > dma-buf, it causes all kinds of pain. Yes on SoC where dma-buf are
> > exclusively in system memory you can maybe get away with this, but
> > dma-buf is supposed to work in more places than just Android SoCs.
>
> I just realized that vm_inser_page doesn't even work for CMA, it would
> upset get_user_pages pretty badly - you're trying to pin a page in
> ZONE_MOVEABLE but you can't move it because it's rather special.
> VM_SPECIAL is exactly meant to catch this stuff.

Thanks for the input, Daniel! Let me think about the cases you pointed out.

IMHO, the issue with PSS is the difficulty of calculating this metric
without struct page usage. I don't think that problem becomes easier
if we use cgroups or any other API. I wanted to enable existing PSS
calculation mechanisms for the dmabufs known to be backed by struct
pages (since we know how the heap allocated that memory), but sounds
like this would lead to problems that I did not consider.
Thanks,
Suren.

> -Daniel
>
> > If you want to account dma-bufs, and gpu memory in general, I'd say
> > the solid solution is cgroups. There's patches floating around. And
> > given that Google Android can't even agree internally on what exactly
> > you want I'd say we just need to cut over to that and make it happen.
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Re: [Linaro-mm-sig] [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-03 Thread Suren Baghdasaryan

On Wed, Feb 3, 2021 at 12:52 AM Daniel Vetter  wrote:
>
> On Wed, Feb 3, 2021 at 2:57 AM Matthew Wilcox  wrote:
> >
> > On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > > vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
> > > and get an indication of an error without panicing the kernel.
> > > This will help identifying drivers that need to clear VM_PFNMAP before
> > > using dmabuf system heap which is moving to use vm_insert_page.
> >
> > NACK.
> >
> > The system may not _panic_, but it is clearly now _broken_.  The device
> > doesn't work, and so the system is useless.  You haven't really improved
> > anything here.  Just bloated the kernel with yet another _ONCE variable
> > that in a normal system will never ever ever be triggered.
>
> Also, what the heck are you doing with your drivers? dma-buf mmap must
> call dma_buf_mmap(), even for forwarded/redirected mmaps from driver
> char nodes. If that doesn't work we have some issues with the calling
> contract for that function, not in vm_insert_page.

The particular issue I observed (details were posted in
https://lore.kernel.org/patchwork/patch/1372409) is that DRM drivers
set VM_PFNMAP flag (via a call to drm_gem_mmap_obj) before calling
dma_buf_mmap. Some drivers clear that flag but some don't. I could not
find the answer to why VM_PFNMAP is required for dmabuf mappings and
maybe someone can explain that here?
If there is a reason to set this flag other than historical use of
carveout memory then we wanted to catch such cases and fix the drivers
that moved to using dmabuf heaps. However maybe there are other
reasons and if so I would be very grateful if someone could explain
them. That would help me to come up with a better solution.

> Finally why exactly do we need to make this switch for system heap?
> I've recently looked at gup usage by random drivers, and found a lot
> of worrying things there. gup on dma-buf is really bad idea in
> general.

The reason for the switch is to be able to account dmabufs allocated
using dmabuf heaps to the processes that map them. The next patch in
this series https://lore.kernel.org/patchwork/patch/1374851
implementing the switch contains more details and there is an active
discussion there. Would you mind joining that discussion to keep it in
one place?
Thanks!

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Re: [PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-03 Thread Suren Baghdasaryan

On Wed, Feb 3, 2021 at 12:06 AM Christian König
 wrote:
>
> Am 03.02.21 um 03:02 schrieb Suren Baghdasaryan:
> > On Tue, Feb 2, 2021 at 5:39 PM Minchan Kim  wrote:
> >> On Tue, Feb 02, 2021 at 04:31:34PM -0800, Suren Baghdasaryan wrote:
> >>> Currently system heap maps its buffers with VM_PFNMAP flag using
> >>> remap_pfn_range. This results in such buffers not being accounted
> >>> for in PSS calculations because vm treats this memory as having no
> >>> page structs. Without page structs there are no counters representing
> >>> how many processes are mapping a page and therefore PSS calculation
> >>> is impossible.
> >>> Historically, ION driver used to map its buffers as VM_PFNMAP areas
> >>> due to memory carveouts that did not have page structs [1]. That
> >>> is not the case anymore and it seems there was desire to move away
> >>> from remap_pfn_range [2].
> >>> Dmabuf system heap design inherits this ION behavior and maps its
> >>> pages using remap_pfn_range even though allocated pages are backed
> >>> by page structs.
> >>> Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
> >>> in [1]. This would allow correct PSS calculation for dmabufs.
> >>>
> >>> [1] 
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriverdev-devel.linuxdriverproject.narkive.com%2Fv0fJGpaD%2Fusing-ion-memory-for-direct-iodata=04%7C01%7Cchristian.koenig%40amd.com%7Cb4c145b86dd0472c943c08d8c7e7ba4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479145389160353%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=W1N%2B%2BlcFDaRSvXdSPe5hPNMRByHfGkU7Uc3cmM3FCTU%3Dreserved=0
> >>> [2] 
> >>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdriverdev.linuxdriverproject.org%2Fpipermail%2Fdriverdev-devel%2F2018-October%2F127519.htmldata=04%7C01%7Cchristian.koenig%40amd.com%7Cb4c145b86dd0472c943c08d8c7e7ba4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479145389160353%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=jQxSzKEr52lUcAIx%2FuBHMJ7yOgof%2FVMlW9%2BB2f%2FoS%2FE%3Dreserved=0
> >>> (sorry, could not find lore links for these discussions)
> >>>
> >>> Suggested-by: Laura Abbott 
> >>> Signed-off-by: Suren Baghdasaryan 
> >> Reviewed-by: Minchan Kim 
> >>
> >> A note: This patch makes dmabuf system heap accounted as PSS so
> >> if someone has relies on the size, they will see the bloat.
> >> IIRC, there was some debate whether PSS accounting for their
> >> buffer is correct or not. If it'd be a problem, we need to
> >> discuss how to solve it(maybe, vma->vm_flags and reintroduce
> >> remap_pfn_range for them to be respected).
> > I did not see debates about not including *mapped* dmabufs into PSS
> > calculation. I remember people were discussing how to account dmabufs
> > referred only by the FD but that is a different discussion. If the
> > buffer is mapped into the address space of a process then IMHO
> > including it into PSS of that process is not controversial.
>
> Well, I think it is. And to be honest this doesn't looks like a good
> idea to me since it will eventually lead to double accounting of system
> heap DMA-bufs.

Thanks for the comment! Could you please expand on this double
accounting issue? Do you mean userspace could double account dmabufs
because it expects dmabufs not to be part of PSS or is there some
in-kernel accounting mechanism that would be broken by this?

>
> As discussed multiple times it is illegal to use the struct page of a
> DMA-buf. This case here is a bit special since it is the owner of the
> pages which does that, but I'm not sure if this won't cause problems
> elsewhere as well.

I would be happy to keep things as they are but calculating dmabuf
contribution to PSS without struct pages is extremely inefficient and
becomes a real pain when we consider the possibilities of partial
mappings, when not the entire dmabuf is being mapped.
Calculating this would require parsing /proc/pid/maps for the process,
finding dmabuf mappings and the size for each one, then parsing
/proc/pid/maps for ALL processes in the system to see if the same
dmabufs are used by other processes and only then calculating the PSS.
I hope that explains the desire to use already existing struct pages
to obtain PSS in a much more efficient way.

>
> A more appropriate solution would be to held processes accountable for
> resources they have allocated through device drivers.

Are you suggesting some new kernel mechanism to account resources
allocated by a process via a driver? If so, any details?

>
> Regards,
> Christian.
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 5:55 PM Matthew Wilcox  wrote:
>
> On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
> > and get an indication of an error without panicing the kernel.
> > This will help identifying drivers that need to clear VM_PFNMAP before
> > using dmabuf system heap which is moving to use vm_insert_page.
>
> NACK.
>
> The system may not _panic_, but it is clearly now _broken_.  The device
> doesn't work, and so the system is useless.  You haven't really improved
> anything here.  Just bloated the kernel with yet another _ONCE variable
> that in a normal system will never ever ever be triggered.

We had a discussion in https://lore.kernel.org/patchwork/patch/1372409
about how some DRM drivers set up their VMAs with VM_PFNMAP before
mapping them. We want to use vm_insert_page instead of remap_pfn_range
in the dmabuf heaps so that this memory is visible in PSS. However if
a driver that sets VM_PFNMAP tries to use a dmabuf heap, it will step
into this BUG_ON. We wanted to catch and gradually fix such drivers
but without causing a panic in the process. I hope this clarifies the
reasons why I'm making this change and I'm open to other ideas if they
would address this issue in a better way.

Re: [PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 6:07 PM John Stultz  wrote:
>
> On Tue, Feb 2, 2021 at 4:31 PM Suren Baghdasaryan  wrote:
> > Currently system heap maps its buffers with VM_PFNMAP flag using
> > remap_pfn_range. This results in such buffers not being accounted
> > for in PSS calculations because vm treats this memory as having no
> > page structs. Without page structs there are no counters representing
> > how many processes are mapping a page and therefore PSS calculation
> > is impossible.
> > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > due to memory carveouts that did not have page structs [1]. That
> > is not the case anymore and it seems there was desire to move away
> > from remap_pfn_range [2].
> > Dmabuf system heap design inherits this ION behavior and maps its
> > pages using remap_pfn_range even though allocated pages are backed
> > by page structs.
> > Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
> > in [1]. This would allow correct PSS calculation for dmabufs.
> >
> > [1] 
> > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > [2] 
> > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > (sorry, could not find lore links for these discussions)
> >
> > Suggested-by: Laura Abbott 
> > Signed-off-by: Suren Baghdasaryan 
>
> For consistency, do we need something similar for the cma heap as well?

Good question. Let me look closer into it.

>
> thanks
> -john

Re: [PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 5:39 PM Minchan Kim  wrote:
>
> On Tue, Feb 02, 2021 at 04:31:34PM -0800, Suren Baghdasaryan wrote:
> > Currently system heap maps its buffers with VM_PFNMAP flag using
> > remap_pfn_range. This results in such buffers not being accounted
> > for in PSS calculations because vm treats this memory as having no
> > page structs. Without page structs there are no counters representing
> > how many processes are mapping a page and therefore PSS calculation
> > is impossible.
> > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > due to memory carveouts that did not have page structs [1]. That
> > is not the case anymore and it seems there was desire to move away
> > from remap_pfn_range [2].
> > Dmabuf system heap design inherits this ION behavior and maps its
> > pages using remap_pfn_range even though allocated pages are backed
> > by page structs.
> > Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
> > in [1]. This would allow correct PSS calculation for dmabufs.
> >
> > [1] 
> > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > [2] 
> > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > (sorry, could not find lore links for these discussions)
> >
> > Suggested-by: Laura Abbott 
> > Signed-off-by: Suren Baghdasaryan 
> Reviewed-by: Minchan Kim 
>
> A note: This patch makes dmabuf system heap accounted as PSS so
> if someone has relies on the size, they will see the bloat.
> IIRC, there was some debate whether PSS accounting for their
> buffer is correct or not. If it'd be a problem, we need to
> discuss how to solve it(maybe, vma->vm_flags and reintroduce
> remap_pfn_range for them to be respected).

I did not see debates about not including *mapped* dmabufs into PSS
calculation. I remember people were discussing how to account dmabufs
referred only by the FD but that is a different discussion. If the
buffer is mapped into the address space of a process then IMHO
including it into PSS of that process is not controversial.

Re: [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 5:31 PM Minchan Kim  wrote:
>
> On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
> > and get an indication of an error without panicing the kernel.
> > This will help identifying drivers that need to clear VM_PFNMAP before
> > using dmabuf system heap which is moving to use vm_insert_page.
> >
> > Suggested-by: Christoph Hellwig 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  mm/memory.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index feff48e1465a..e503c9801cd9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1827,7 +1827,8 @@ int vm_insert_page(struct vm_area_struct *vma, 
> > unsigned long addr,
> >   return -EINVAL;
> >   if (!(vma->vm_flags & VM_MIXEDMAP)) {
> >   BUG_ON(mmap_read_trylock(vma->vm_mm));
>
> Better to replace above BUG_ON with WARN_ON_ONCE, too?

If nobody objects I'll do that in the next respin. Thanks!

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

[PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

Currently system heap maps its buffers with VM_PFNMAP flag using
remap_pfn_range. This results in such buffers not being accounted
for in PSS calculations because vm treats this memory as having no
page structs. Without page structs there are no counters representing
how many processes are mapping a page and therefore PSS calculation
is impossible.
Historically, ION driver used to map its buffers as VM_PFNMAP areas
due to memory carveouts that did not have page structs [1]. That
is not the case anymore and it seems there was desire to move away
from remap_pfn_range [2].
Dmabuf system heap design inherits this ION behavior and maps its
pages using remap_pfn_range even though allocated pages are backed
by page structs.
Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
in [1]. This would allow correct PSS calculation for dmabufs.

[1] 
https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
[2] 
http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
(sorry, could not find lore links for these discussions)

Suggested-by: Laura Abbott 
Signed-off-by: Suren Baghdasaryan 
---
v1 posted at: https://lore.kernel.org/patchwork/patch/1372409/

changes in v2:
- removed VM_PFNMAP clearing part of the patch, per Minchan and Christoph
- created prerequisite patch to replace BUG_ON with WARN_ON_ONCE, per Christoph

 drivers/dma-buf/heaps/system_heap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c 
b/drivers/dma-buf/heaps/system_heap.c
index 17e0e9a68baf..4983f18cc2ce 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -203,8 +203,7 @@ static int system_heap_mmap(struct dma_buf *dmabuf, struct 
vm_area_struct *vma)
for_each_sgtable_page(table, , vma->vm_pgoff) {
struct page *page = sg_page_iter_page();
 
-   ret = remap_pfn_range(vma, addr, page_to_pfn(page), PAGE_SIZE,
- vma->vm_page_prot);
+   ret = vm_insert_page(vma, addr, page);
if (ret)
return ret;
addr += PAGE_SIZE;
-- 
2.30.0.365.g02bc693789-goog

[PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-02 Thread Suren Baghdasaryan

Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
WARN_ON_ONCE and returning an error. This is to ensure users of the
vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
and get an indication of an error without panicing the kernel.
This will help identifying drivers that need to clear VM_PFNMAP before
using dmabuf system heap which is moving to use vm_insert_page.

Suggested-by: Christoph Hellwig 
Signed-off-by: Suren Baghdasaryan 
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..e503c9801cd9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1827,7 +1827,8 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned 
long addr,
return -EINVAL;
if (!(vma->vm_flags & VM_MIXEDMAP)) {
BUG_ON(mmap_read_trylock(vma->vm_mm));
-   BUG_ON(vma->vm_flags & VM_PFNMAP);
+   if (WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP))
+   return -EINVAL;
vma->vm_flags |= VM_MIXEDMAP;
}
return insert_page(vma, addr, page, vma->vm_page_prot);
-- 
2.30.0.365.g02bc693789-goog

Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-02 Thread Suren Baghdasaryan

Hi Michael,

On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren (and Minchan and Michal)
>
> Thank you for the revisions!
>
> I've applied this patch, and done a few light edits.

Thanks!

>
> However, I have a questions about undocumented pieces in *madvise(2)*,
> as well as one other question. See below.
>
> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> > Initial version of process_madvise(2) manual page. Initial text was
> > extracted from [1], amended after fix [2] and more details added using
> > man pages of madvise(2) and process_vm_read(2) as examples. It also
> > includes the changes to required permission proposed in [3].
> >
> > [1] https://lore.kernel.org/patchwork/patch/1297933/
> > [2] https://lkml.org/lkml/2020/12/8/1282
> > [3] 
> > https://patchwork.kernel.org/project/selinux/patch/20210111170622.2613577-1-sur...@google.com/#23888311
> >
> > Signed-off-by: Suren Baghdasaryan 
> > Reviewed-by: Michal Hocko 
> > ---
> > changes in v2:
> > - Changed description of MADV_COLD per Michal Hocko's suggestion
> > - Applied fixes suggested by Michael Kerrisk
> > changes in v3:
> > - Added Michal's Reviewed-by
> > - Applied additional fixes suggested by Michael Kerrisk
> >
> > NAME
> > process_madvise - give advice about use of memory to a process
> >
> > SYNOPSIS
> > #include 
> >
> > ssize_t process_madvise(int pidfd,
> >const struct iovec *iovec,
> >unsigned long vlen,
> >int advice,
> >unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges of another process or the calling
> > process. It provides the advice to the address ranges described by iovec
> > and vlen. The goal of such advice is to improve system or application
> > performance.
> >
> > The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> > specifies the process to which the advice is to be applied.
> >
> > The pointer iovec points to an array of iovec structures, defined in
> >  as:
> >
> > struct iovec {
> > void  *iov_base;/* Starting address */
> > size_t iov_len; /* Number of bytes to transfer */
> > };
> >
> > The iovec structure describes address ranges beginning at iov_base 
> > address
> > and with the size of iov_len bytes.
> >
> > The vlen represents the number of elements in the iovec structure.
> >
> > The advice argument is one of the values listed below.
> >
> >   Linux-specific advice values
> > The following Linux-specific advice values have no counterparts in the
> > POSIX-specified posix_madvise(3), and may or may not have counterparts
> > in the madvise(2) interface available on other implementations.
> >
> > MADV_COLD (since Linux 5.4.1)
>
> I just noticed these version numbers now, and thought: they can't be
> right (because the system call appeared only in v5.11). So I removed
> them. But, of course in another sense the version numbers are (nearly)
> right, since these advice values were added for madvise(2) in Linux 5.4.
> However, they are not documented in the madvise(2) manual page. Is it
> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
> meaning in madvise(2) (but just for the calling process, of course)?

Correct. They should be added in the madvise(2) man page as well IMHO.

>
> > Deactive a given range of pages which will make them a more probable
>
> I changed: s/Deactive/Deactivate/

thanks!

>
> > reclaim target should there be a memory pressure. This is a
> > nondestructive operation. The advice might be ignored for some pages
> > in the range when it is not applicable.
> >
> > MADV_PAGEOUT (since Linux 5.4.1)
> > Reclaim a given range of pages. This is done to free up memory 
> > occupied
> > by these pages. If a page is anonymous it will be swapped out. If a
> > page is file-backed and dirty it will be written back to the backing
> > storage. The advice might be ignored for some pages in the range 
> > when
> > it is not applicable.
>
> [...]
>
> > The hint might be applied to a part of iovec if one of its elements 
> > points
> > to an invalid memory region in the remote

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 12:51 AM Christoph Hellwig  wrote:
>
> On Tue, Feb 02, 2021 at 12:44:44AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Feb 1, 2021 at 11:03 PM Christoph Hellwig  
> > wrote:
> > >
> > > IMHO the
> > >
> > > BUG_ON(vma->vm_flags & VM_PFNMAP);
> > >
> > > in vm_insert_page should just become a WARN_ON_ONCE with an error
> > > return, and then we just need to gradually fix up the callers that
> > > trigger it instead of coming up with workarounds like this.
> >
> > For the existing vm_insert_page users this should be fine since
> > BUG_ON() guarantees that none of them sets VM_PFNMAP.
>
> Even for them WARN_ON_ONCE plus an actual error return is a way
> better assert that is much developer friendly.

Agree.

>
> > However, for the
> > system_heap_mmap I have one concern. When vm_insert_page returns an
> > error due to VM_PFNMAP flag, the whole mmap operation should fail
> > (system_heap_mmap returning an error leading to dma_buf_mmap failure).
> > Could there be cases when a heap user (DRM driver for example) would
> > be expected to work with a heap which requires VM_PFNMAP and at the
> > same time with another heap which requires !VM_PFNMAP? IOW, this
> > introduces a dependency between the heap and its
> > user. The user would have to know expectations of the heap it uses and
> > can't work with another heap that has the opposite expectation. This
> > usecase is purely theoretical and maybe I should not worry about it
> > for now?
>
> If such a case ever arises we can look into it.

Sounds good. I'll prepare a new patch and will post it later today. Thanks!

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Mon, Feb 1, 2021 at 11:03 PM Christoph Hellwig  wrote:
>
> IMHO the
>
> BUG_ON(vma->vm_flags & VM_PFNMAP);
>
> in vm_insert_page should just become a WARN_ON_ONCE with an error
> return, and then we just need to gradually fix up the callers that
> trigger it instead of coming up with workarounds like this.

For the existing vm_insert_page users this should be fine since
BUG_ON() guarantees that none of them sets VM_PFNMAP. However, for the
system_heap_mmap I have one concern. When vm_insert_page returns an
error due to VM_PFNMAP flag, the whole mmap operation should fail
(system_heap_mmap returning an error leading to dma_buf_mmap failure).
Could there be cases when a heap user (DRM driver for example) would
be expected to work with a heap which requires VM_PFNMAP and at the
same time with another heap which requires !VM_PFNMAP? IOW, this
introduces a dependency between the heap and its
user. The user would have to know expectations of the heap it uses and
can't work with another heap that has the opposite expectation. This
usecase is purely theoretical and maybe I should not worry about it
for now?

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-02-01 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 11:08 PM Suren Baghdasaryan  wrote:
>
> On Thu, Jan 28, 2021 at 11:51 AM Suren Baghdasaryan  wrote:
> >
> > On Tue, Jan 26, 2021 at 5:52 AM 'Michal Hocko' via kernel-team
> >  wrote:
> > >
> > > On Wed 20-01-21 14:17:39, Jann Horn wrote:
> > > > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  
> > > > > > wrote:
> > > > > > >
> > > > > > > On 01/12, Michal Hocko wrote:
> > > > > > > >
> > > > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > > > >
> > > > > > > > > What we want is the ability for one process to influence 
> > > > > > > > > another process
> > > > > > > > > in order to optimize performance across the entire system 
> > > > > > > > > while leaving
> > > > > > > > > the security boundary intact.
> > > > > > > > > Replace PTRACE_MODE_ATTACH with a combination of 
> > > > > > > > > PTRACE_MODE_READ
> > > > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > > > metadata
> > > > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > > > >
> > > > > > > > I have to say that ptrace modes are rather obscure to me. So I 
> > > > > > > > cannot
> > > > > > > > really judge whether MODE_READ is sufficient. My understanding 
> > > > > > > > has
> > > > > > > > always been that this is requred to RO access to the address 
> > > > > > > > space. But
> > > > > > > > this operation clearly has a visible side effect. Do we have 
> > > > > > > > any actual
> > > > > > > > documentation for the existing modes?
> > > > > > > >
> > > > > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > > > > >
> > > > > > > Can't comment, sorry. I never understood these security checks 
> > > > > > > and never tried.
> > > > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I 
> > > > > > > have no idea what
> > > > > > > is the difference.
> > > >
> > > > Yama in particular only does its checks on ATTACH and ignores READ,
> > > > that's the difference you're probably most likely to encounter on a
> > > > normal desktop system, since some distros turn Yama on by default.
> > > > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > > > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > > > still works; so you can see things like detailed memory usage
> > > > information and such, but you're not supposed to be able to directly
> > > > peek into a running SSH client and inject data into the existing SSH
> > > > connection, or steal the cryptographic keys for the current
> > > > connection, or something like that.
> > > >
> > > > > > I haven't seen a written explanation on ptrace modes but when I
> > > > > > consulted Jann his explanation was:
> > > > > >
> > > > > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > > > > the specified domain, across UID boundaries.
> > > > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with 
> > > > > > the
> > > > > > specified domain, across UID boundaries.
> > > > >
> > > > > Maybe this would be a good start to document expectations. Some more
> > > > > practical examples where the difference is visible would be great as
> > > > > well.
> > > >
> > > > Before documenting the behavior, it would be a good idea to figure out
> > > > what to do with perf_event_open(). That one's weird in that it only
> > > > requires PTRACE_MODE_READ, but actually allows you to sample stuff
> > > > like userspace stack and register contents (if perf_event_par

[PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-01 Thread Suren Baghdasaryan

Initial version of process_madvise(2) manual page. Initial text was
extracted from [1], amended after fix [2] and more details added using
man pages of madvise(2) and process_vm_read(2) as examples. It also
includes the changes to required permission proposed in [3].

[1] https://lore.kernel.org/patchwork/patch/1297933/
[2] https://lkml.org/lkml/2020/12/8/1282
[3] 
https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Michal Hocko 
---
changes in v2:
- Changed description of MADV_COLD per Michal Hocko's suggestion
- Applied fixes suggested by Michael Kerrisk
changes in v3:
- Added Michal's Reviewed-by
- Applied additional fixes suggested by Michael Kerrisk

NAME
process_madvise - give advice about use of memory to a process

SYNOPSIS
#include 

ssize_t process_madvise(int pidfd,
   const struct iovec *iovec,
   unsigned long vlen,
   int advice,
   unsigned int flags);

DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges of another process or the calling
process. It provides the advice to the address ranges described by iovec
and vlen. The goal of such advice is to improve system or application
performance.

The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
specifies the process to which the advice is to be applied.

The pointer iovec points to an array of iovec structures, defined in
 as:

struct iovec {
void  *iov_base;/* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};

The iovec structure describes address ranges beginning at iov_base address
and with the size of iov_len bytes.

The vlen represents the number of elements in the iovec structure.

The advice argument is one of the values listed below.

  Linux-specific advice values
The following Linux-specific advice values have no counterparts in the
POSIX-specified posix_madvise(3), and may or may not have counterparts
in the madvise(2) interface available on other implementations.

MADV_COLD (since Linux 5.4.1)
Deactive a given range of pages which will make them a more probable
reclaim target should there be a memory pressure. This is a
nondestructive operation. The advice might be ignored for some pages
in the range when it is not applicable.

MADV_PAGEOUT (since Linux 5.4.1)
Reclaim a given range of pages. This is done to free up memory occupied
by these pages. If a page is anonymous it will be swapped out. If a
page is file-backed and dirty it will be written back to the backing
storage. The advice might be ignored for some pages in the range when
it is not applicable.

The flags argument is reserved for future use; currently, this argument
must be specified as 0.

The value specified in the vlen argument must be less than or equal to
IOV_MAX (defined in  or accessible via the call
sysconf(_SC_IOV_MAX)).

The vlen and iovec arguments are checked before applying any hints. If
the vlen is too big, or iovec is invalid, an error will be returned
immediately and no advice will be applied.

The hint might be applied to a part of iovec if one of its elements points
to an invalid memory region in the remote process. No further elements will
be processed beyond that point.

Permission to provide a hint to another process is governed by a ptrace
access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in addition,
the caller must have the CAP_SYS_ADMIN capability due to performance
implications of applying the hint.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised. This
return value may be less than the total number of requested bytes, if an
error occurred after some iovec elements were already processed. The caller
should check the return value to determine whether a partial advice
occurred.

On error, -1 is returned and errno is set to indicate the error.

ERRORS
EBADF pidfd is not a valid PID file descriptor.
EFAULT The memory described by iovec is outside the accessible address
   space of the process referred to by pidfd.
EINVAL flags is not 0.
EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
EINVAL vlen is too large.
ENOMEM Could not allocate memory for internal copies of the iovec
   structures.
EPERM The caller does not have permission to access the address space of
  the process pidfd.
ESRCH The target process does not exist (i.e., it has terminated and been
  waited on).

VERSIONS
This system call first appeared in Linux 5.10. Support for this system

Re: [PATCH v2 1/1] process_madvise.2: Add process_madvise man page

2021-02-01 Thread Suren Baghdasaryan

On Sat, Jan 30, 2021 at 1:34 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> Thank you for the revisions! Just a few more comments: all pretty small
> stuff (many points that I overlooked the first time rround), since the
> page already looks pretty good by now.
>
> Again, thanks for the rendered version. As before, I've added my
> comments to the page source.

Hi Michael,
Thanks for reviewing!

>
> On 1/29/21 8:03 AM, Suren Baghdasaryan wrote:
> > Initial version of process_madvise(2) manual page. Initial text was
> > extracted from [1], amended after fix [2] and more details added using
> > man pages of madvise(2) and process_vm_read(2) as examples. It also
> > includes the changes to required permission proposed in [3].
> >
> > [1] https://lore.kernel.org/patchwork/patch/1297933/
> > [2] https://lkml.org/lkml/2020/12/8/1282
> > [3] 
> > https://patchwork.kernel.org/project/selinux/patch/20210111170622.2613577-1-sur...@google.com/#23888311
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> > changes in v2:
> > - Changed description of MADV_COLD per Michal Hocko's suggestion
> > - Appled fixes suggested by Michael Kerrisk
> >
> > NAME
> > process_madvise - give advice about use of memory to a process
>
> s/-/\-/

ack

>
> >
> > SYNOPSIS
> > #include 
> >
> > ssize_t process_madvise(int pidfd,
> >const struct iovec *iovec,
> >unsigned long vlen,
> >int advice,
> >unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges of other process as well as of
> > the calling process. It provides the advice to address ranges of process
> > described by iovec and vlen. The goal of such advice is to improve 
> > system
> > or application performance.
> >
> > The pidfd argument is a PID file descriptor (see pidofd_open(2)) that
> > specifies the process to which the advice is to be applied.
> >
> > The pointer iovec points to an array of iovec structures, defined in
> >  as:
> >
> > struct iovec {
> > void  *iov_base;/* Starting address */
> > size_t iov_len; /* Number of bytes to transfer */
> > };
> >
> > The iovec structure describes address ranges beginning at iov_base 
> > address
> > and with the size of iov_len bytes.
> >
> > The vlen represents the number of elements in the iovec structure.
> >
> > The advice argument is one of the values listed below.
> >
> >   Linux-specific advice values
> > The following Linux-specific advice values have no counterparts in the
> > POSIX-specified posix_madvise(3), and may or may not have counterparts
> > in the madvise(2) interface available on other implementations.
> >
> > MADV_COLD (since Linux 5.4.1)
> > Deactive a given range of pages which will make them a more probable
> > reclaim target should there be a memory pressure. This is a non-
> > destructive operation. The advice might be ignored for some pages in
> > the range when it is not applicable.
> >
> > MADV_PAGEOUT (since Linux 5.4.1)
> > Reclaim a given range of pages. This is done to free up memory 
> > occupied
> > by these pages. If a page is anonymous it will be swapped out. If a
> > page is file-backed and dirty it will be written back to the backing
> > storage. The advice might be ignored for some pages in the range 
> > when
> > it is not applicable.
> >
> > The flags argument is reserved for future use; currently, this argument
> > must be specified as 0.
> >
> > The value specified in the vlen argument must be less than or equal to
> > IOV_MAX (defined in  or accessible via the call
> > sysconf(_SC_IOV_MAX)).
> >
> > The vlen and iovec arguments are checked before applying any hints. If
> > the vlen is too big, or iovec is invalid, an error will be returned
> > immediately.
> >
> > The hint might be applied to a part of iovec if one of its elements 
> > points
> > to an invalid memory region in the remote process. No further elements 
> > will
> > be processed beyond that point.
> >
> > Permission to provide a hint to another process is governed by a ptra

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-01 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 11:00 AM Suren Baghdasaryan  wrote:
>
> On Thu, Jan 28, 2021 at 10:19 AM Minchan Kim  wrote:
> >
> > On Thu, Jan 28, 2021 at 09:52:59AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Jan 28, 2021 at 1:13 AM Christoph Hellwig  
> > > wrote:
> > > >
> > > > On Thu, Jan 28, 2021 at 12:38:17AM -0800, Suren Baghdasaryan wrote:
> > > > > Currently system heap maps its buffers with VM_PFNMAP flag using
> > > > > remap_pfn_range. This results in such buffers not being accounted
> > > > > for in PSS calculations because vm treats this memory as having no
> > > > > page structs. Without page structs there are no counters representing
> > > > > how many processes are mapping a page and therefore PSS calculation
> > > > > is impossible.
> > > > > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > > > > due to memory carveouts that did not have page structs [1]. That
> > > > > is not the case anymore and it seems there was desire to move away
> > > > > from remap_pfn_range [2].
> > > > > Dmabuf system heap design inherits this ION behavior and maps its
> > > > > pages using remap_pfn_range even though allocated pages are backed
> > > > > by page structs.
> > > > > Clear VM_IO and VM_PFNMAP flags when mapping memory allocated by the
> > > > > system heap and replace remap_pfn_range with vm_insert_page, following
> > > > > Laura's suggestion in [1]. This would allow correct PSS calculation
> > > > > for dmabufs.
> > > > >
> > > > > [1] 
> > > > > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > > > > [2] 
> > > > > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > > > > (sorry, could not find lore links for these discussions)
> > > > >
> > > > > Suggested-by: Laura Abbott 
> > > > > Signed-off-by: Suren Baghdasaryan 
> > > > > ---
> > > > >  drivers/dma-buf/heaps/system_heap.c | 6 --
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/dma-buf/heaps/system_heap.c 
> > > > > b/drivers/dma-buf/heaps/system_heap.c
> > > > > index 17e0e9a68baf..0e92e42b2251 100644
> > > > > --- a/drivers/dma-buf/heaps/system_heap.c
> > > > > +++ b/drivers/dma-buf/heaps/system_heap.c
> > > > > @@ -200,11 +200,13 @@ static int system_heap_mmap(struct dma_buf 
> > > > > *dmabuf, struct vm_area_struct *vma)
> > > > >   struct sg_page_iter piter;
> > > > >   int ret;
> > > > >
> > > > > + /* All pages are backed by a "struct page" */
> > > > > + vma->vm_flags &= ~VM_PFNMAP;
> > > >
> > > > Why do we clear this flag?  It shouldn't even be set here as far as I
> > > > can tell.
> > >
> > > Thanks for the question, Christoph.
> > > I tracked down that flag being set by drm_gem_mmap_obj() which DRM
> > > drivers use to "Set up the VMA to prepare mapping of the GEM object"
> > > (according to drm_gem_mmap_obj comments). I also see a pattern in
> > > several DMR drivers to call drm_gem_mmap_obj()/drm_gem_mmap(), then
> > > clear VM_PFNMAP and then map the VMA (for example here:
> > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/rockchip/rockchip_drm_gem.c#L246).
> > > I thought that dmabuf allocator (in this case the system heap) would
> > > be the right place to set these flags because it controls how memory
> > > is allocated before mapping. However it's quite possible that I'm
> >
> > However, you're not setting but removing a flag under the caller.
> > It's different with appending more flags(e.g., removing condition
> > vs adding more conditions). If we should remove the flag, caller
> > didn't need to set it from the beginning. Hiding it under this API
> > continue to make wrong usecase in future.
>
> Which takes us back to the question of why VM_PFNMAP is being set by
> the caller in the first place.
>
> >
> > > missing the real reason for VM_PFNMAP being set in drm_gem_mmap_obj()
> > > before dma_buf_mmap() is called. I could not find the answer to that,
> > > so I hope someone here can clarify that.
> &

Re: [PATCH v2 1/1] process_madvise.2: Add process_madvise man page

2021-01-29 Thread Suren Baghdasaryan

On Fri, Jan 29, 2021 at 1:13 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Thu 28-01-21 23:03:40, Suren Baghdasaryan wrote:
> > Initial version of process_madvise(2) manual page. Initial text was
> > extracted from [1], amended after fix [2] and more details added using
> > man pages of madvise(2) and process_vm_read(2) as examples. It also
> > includes the changes to required permission proposed in [3].
> >
> > [1] https://lore.kernel.org/patchwork/patch/1297933/
> > [2] https://lkml.org/lkml/2020/12/8/1282
> > [3] 
> > https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
> >
> > Signed-off-by: Suren Baghdasaryan 
>
> Reviewed-by: Michal Hocko 

Thanks!

> Thanks!
>
> > ---
> > changes in v2:
> > - Changed description of MADV_COLD per Michal Hocko's suggestion
> > - Appled fixes suggested by Michael Kerrisk
> >
> > NAME
> > process_madvise - give advice about use of memory to a process
> >
> > SYNOPSIS
> > #include 
> >
> > ssize_t process_madvise(int pidfd,
> >const struct iovec *iovec,
> >unsigned long vlen,
> >int advice,
> >unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges of other process as well as of
> > the calling process. It provides the advice to address ranges of process
> > described by iovec and vlen. The goal of such advice is to improve 
> > system
> > or application performance.
> >
> > The pidfd argument is a PID file descriptor (see pidofd_open(2)) that
> > specifies the process to which the advice is to be applied.
> >
> > The pointer iovec points to an array of iovec structures, defined in
> >  as:
> >
> > struct iovec {
> > void  *iov_base;/* Starting address */
> > size_t iov_len; /* Number of bytes to transfer */
> > };
> >
> > The iovec structure describes address ranges beginning at iov_base 
> > address
> > and with the size of iov_len bytes.
> >
> > The vlen represents the number of elements in the iovec structure.
> >
> > The advice argument is one of the values listed below.
> >
> >   Linux-specific advice values
> > The following Linux-specific advice values have no counterparts in the
> > POSIX-specified posix_madvise(3), and may or may not have counterparts
> > in the madvise(2) interface available on other implementations.
> >
> > MADV_COLD (since Linux 5.4.1)
> > Deactive a given range of pages which will make them a more probable
> > reclaim target should there be a memory pressure. This is a non-
> > destructive operation. The advice might be ignored for some pages in
> > the range when it is not applicable.
> >
> > MADV_PAGEOUT (since Linux 5.4.1)
> > Reclaim a given range of pages. This is done to free up memory 
> > occupied
> > by these pages. If a page is anonymous it will be swapped out. If a
> > page is file-backed and dirty it will be written back to the backing
> > storage. The advice might be ignored for some pages in the range 
> > when
> > it is not applicable.
> >
> > The flags argument is reserved for future use; currently, this argument
> > must be specified as 0.
> >
> > The value specified in the vlen argument must be less than or equal to
> > IOV_MAX (defined in  or accessible via the call
> > sysconf(_SC_IOV_MAX)).
> >
> > The vlen and iovec arguments are checked before applying any hints. If
> > the vlen is too big, or iovec is invalid, an error will be returned
> > immediately.
> >
> > The hint might be applied to a part of iovec if one of its elements 
> > points
> > to an invalid memory region in the remote process. No further elements 
> > will
> > be processed beyond that point.
> >
> > Permission to provide a hint to another process is governed by a ptrace
> > access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in 
> > addition,
> > the caller must have the CAP_SYS_ADMIN capability due to performance
> > implications of applying the hint.
> >
> > RETURN VALUE
> > On success, process_madvise() returns the

Re: [PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 12:31 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> > On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Suren,
> >>
> >> Thank you for writing this page! Some comments below.
> >
> > Thanks for the review!
> > Couple questions below and I'll respin the new version once they are 
> > clarified.
>
> Okay. See below.
>
> >> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan  wrote:
> >>>
>
> [...]
>
> Thanks for all the acks. That let's me know that you saw what I said.
>
> >>> RETURN VALUE
> >>> On success, process_madvise() returns the number of bytes advised. 
> >>> This
> >>> return value may be less than the total number of requested bytes, if 
> >>> an
> >>> error occurred. The caller should check return value to determine 
> >>> whether
> >>> a partial advice occurred.
> >>
> >> So there are three return values possible,
> >
> > Ok, I think I see your point. How about this instead:
>
> Well, I'm glad you saw it, because I forgot to finish it. But yes,
> you understood what I forgot to say.
>
> > RETURN VALUE
> >  On success, process_madvise() returns the number of bytes advised. This
> >  return value may be less than the total number of requested bytes, if 
> > an
> >  error occurred after some iovec elements were already processed. The 
> > caller
> >  should check the return value to determine whether a partial
> > advice occurred.
> >
> > On error, -1 is returned and errno is set appropriately.
>
> We recently standardized some wording here:
> s/appropriately/to indicate the error/.
>
>
> >>> +.PP
> >>> +The pointer
> >>> +.I iovec
> >>> +points to an array of iovec structures, defined in
> >>
> >> "iovec" should be formatted as
> >>
> >> .I iovec
> >
> > I think it is formatted that way above. What am I missing?
>
> But also in "an array of iovec structures"...
>
> > BTW, where should I be using .I vs .IR? I was looking for an answer
> > but could not find it.
>
> .B / .I == bold/italic this line
> .BR / .IR == alternate bold/italic with normal (Roman) font.
>
> So:
> .I iovec
> .I iovec ,   # so that comma is not italic
> .BR process_madvise ()
> etc.
>
> [...]
>
> >>> +.I iovec
> >>> +if one of its elements points to an invalid memory
> >>> +region in the remote process. No further elements will be
> >>> +processed beyond that point.
> >>> +.PP
> >>> +Permission to provide a hint to external process is governed by a
> >>> +ptrace access mode
> >>> +.B PTRACE_MODE_READ_REALCREDS
> >>> +check; see
> >>> +.BR ptrace (2)
> >>> +and
> >>> +.B CAP_SYS_ADMIN
> >>> +capability that caller should have in order to affect performance
> >>> +of an external process.
> >>
> >> The preceding sentence is garbled. Missing words?
> >
> > Maybe I worded it incorrectly. What I need to say here is that the
> > caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> > CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> > from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> > tried adding the second one to it, obviously unsuccessfully. Any
> > advice on how to fix that?
>
> I think you already got pretty close. How about:
>
> [[
> Permission to provide a hint to another process is governed by a
> ptrace access mode
> .B PTRACE_MODE_READ_REALCREDS
> check (see
> BR ptrace (2));
> in addition, the caller must have the
> .B CAP_SYS_ADMIN
> capability.

In V2 I explanded a bit this part to explain why CAP_SYS_ADMIN is
needed. There were questions about that during my patch review which
adds this requirement
(https://lore.kernel.org/patchwork/patch/1363605), so I thought a
short explanation would be useful.

> ]]
>
> [...]
>
> >>> +.TP
> >>> +.B ESRCH
> >>> +No process with ID
> >>> +.I pidfd
> >>> +exists.
> >>
> >> Should this maybe be:
> >> [[
> >> The target process does not exist (i.e., it has terminated and
> >> been waited on).
> >> ]]
> >>
> >> See pidfd_send_signal(2).
> >
> > I "borrowed" mine from
> > https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> > either one sounds good to me. Maybe for pidfd_send_signal the wording
> > about termination is more important. Anyway, it's up to you. Just let
> > me know which one to use.
>
> I think the pidfd_send_signal(2) wording fits better.
>
> [...]
>
> Thanks,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 11:51 AM Suren Baghdasaryan  wrote:
>
> On Tue, Jan 26, 2021 at 5:52 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Wed 20-01-21 14:17:39, Jann Horn wrote:
> > > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > > > > >
> > > > > > On 01/12, Michal Hocko wrote:
> > > > > > >
> > > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > > >
> > > > > > > > What we want is the ability for one process to influence 
> > > > > > > > another process
> > > > > > > > in order to optimize performance across the entire system while 
> > > > > > > > leaving
> > > > > > > > the security boundary intact.
> > > > > > > > Replace PTRACE_MODE_ATTACH with a combination of 
> > > > > > > > PTRACE_MODE_READ
> > > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > > metadata
> > > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > > >
> > > > > > > I have to say that ptrace modes are rather obscure to me. So I 
> > > > > > > cannot
> > > > > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > > > > always been that this is requred to RO access to the address 
> > > > > > > space. But
> > > > > > > this operation clearly has a visible side effect. Do we have any 
> > > > > > > actual
> > > > > > > documentation for the existing modes?
> > > > > > >
> > > > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > > > >
> > > > > > Can't comment, sorry. I never understood these security checks and 
> > > > > > never tried.
> > > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I have 
> > > > > > no idea what
> > > > > > is the difference.
> > >
> > > Yama in particular only does its checks on ATTACH and ignores READ,
> > > that's the difference you're probably most likely to encounter on a
> > > normal desktop system, since some distros turn Yama on by default.
> > > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > > still works; so you can see things like detailed memory usage
> > > information and such, but you're not supposed to be able to directly
> > > peek into a running SSH client and inject data into the existing SSH
> > > connection, or steal the cryptographic keys for the current
> > > connection, or something like that.
> > >
> > > > > I haven't seen a written explanation on ptrace modes but when I
> > > > > consulted Jann his explanation was:
> > > > >
> > > > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > > > the specified domain, across UID boundaries.
> > > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > > > > specified domain, across UID boundaries.
> > > >
> > > > Maybe this would be a good start to document expectations. Some more
> > > > practical examples where the difference is visible would be great as
> > > > well.
> > >
> > > Before documenting the behavior, it would be a good idea to figure out
> > > what to do with perf_event_open(). That one's weird in that it only
> > > requires PTRACE_MODE_READ, but actually allows you to sample stuff
> > > like userspace stack and register contents (if perf_event_paranoid is
> > > 1 or 2). Maybe for SELinux things (and maybe also for Yama), there
> > > should be a level in between that allows fully inspecting the process
> > > (for purposes like profiling) but without the ability to corrupt its
> > > memory or registers or things like that. Or maybe perf_event_open()
> > > should just use the ATTACH mode.
> >
> > Thanks for the clarification. I still cannot say I would have a good
> > mental p

[PATCH v2 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Suren Baghdasaryan

Initial version of process_madvise(2) manual page. Initial text was
extracted from [1], amended after fix [2] and more details added using
man pages of madvise(2) and process_vm_read(2) as examples. It also
includes the changes to required permission proposed in [3].

[1] https://lore.kernel.org/patchwork/patch/1297933/
[2] https://lkml.org/lkml/2020/12/8/1282
[3] 
https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311

Signed-off-by: Suren Baghdasaryan 
---
changes in v2:
- Changed description of MADV_COLD per Michal Hocko's suggestion
- Appled fixes suggested by Michael Kerrisk

NAME
process_madvise - give advice about use of memory to a process

SYNOPSIS
#include 

ssize_t process_madvise(int pidfd,
   const struct iovec *iovec,
   unsigned long vlen,
   int advice,
   unsigned int flags);

DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges of other process as well as of
the calling process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve system
or application performance.

The pidfd argument is a PID file descriptor (see pidofd_open(2)) that
specifies the process to which the advice is to be applied.

The pointer iovec points to an array of iovec structures, defined in
 as:

struct iovec {
void  *iov_base;/* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};

The iovec structure describes address ranges beginning at iov_base address
and with the size of iov_len bytes.

The vlen represents the number of elements in the iovec structure.

The advice argument is one of the values listed below.

  Linux-specific advice values
The following Linux-specific advice values have no counterparts in the
POSIX-specified posix_madvise(3), and may or may not have counterparts
in the madvise(2) interface available on other implementations.

MADV_COLD (since Linux 5.4.1)
Deactive a given range of pages which will make them a more probable
reclaim target should there be a memory pressure. This is a non-
destructive operation. The advice might be ignored for some pages in
the range when it is not applicable.

MADV_PAGEOUT (since Linux 5.4.1)
Reclaim a given range of pages. This is done to free up memory occupied
by these pages. If a page is anonymous it will be swapped out. If a
page is file-backed and dirty it will be written back to the backing
storage. The advice might be ignored for some pages in the range when
it is not applicable.

The flags argument is reserved for future use; currently, this argument
must be specified as 0.

The value specified in the vlen argument must be less than or equal to
IOV_MAX (defined in  or accessible via the call
sysconf(_SC_IOV_MAX)).

The vlen and iovec arguments are checked before applying any hints. If
the vlen is too big, or iovec is invalid, an error will be returned
immediately.

The hint might be applied to a part of iovec if one of its elements points
to an invalid memory region in the remote process. No further elements will
be processed beyond that point.

Permission to provide a hint to another process is governed by a ptrace
access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in addition,
the caller must have the CAP_SYS_ADMIN capability due to performance
implications of applying the hint.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised. This
return value may be less than the total number of requested bytes, if an
error occurred after some iovec elements were already processed. The caller
should check the return value to determine whether a partial advice
occurred.

On error, -1 is returned and errno is set to indicate the error.

ERRORS
EFAULT The memory described by iovec is outside the accessible address
   space of the process referred to by pidfd.
EINVAL flags is not 0.
EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
EINVAL vlen is too large.
ENOMEM Could not allocate memory for internal copies of the iovec
   structures.
EPERM The caller does not have permission to access the address space of
  the process pidfd.
ESRCH The target process does not exist (i.e., it has terminated and been
  waited on).
EBADF pidfd is not a valid PID file descriptor.

VERSIONS
This system call first appeared in Linux 5.10, Support for this system
call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS
configuration option.

SEE ALSO
madvise(2

Re: [PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 12:31 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> > On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Suren,
> >>
> >> Thank you for writing this page! Some comments below.
> >
> > Thanks for the review!
> > Couple questions below and I'll respin the new version once they are 
> > clarified.
>
> Okay. See below.
>
> >> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan  wrote:
> >>>
>
> [...]
>
> Thanks for all the acks. That let's me know that you saw what I said.
>
> >>> RETURN VALUE
> >>> On success, process_madvise() returns the number of bytes advised. 
> >>> This
> >>> return value may be less than the total number of requested bytes, if 
> >>> an
> >>> error occurred. The caller should check return value to determine 
> >>> whether
> >>> a partial advice occurred.
> >>
> >> So there are three return values possible,
> >
> > Ok, I think I see your point. How about this instead:
>
> Well, I'm glad you saw it, because I forgot to finish it. But yes,
> you understood what I forgot to say.
>
> > RETURN VALUE
> >  On success, process_madvise() returns the number of bytes advised. This
> >  return value may be less than the total number of requested bytes, if 
> > an
> >  error occurred after some iovec elements were already processed. The 
> > caller
> >  should check the return value to determine whether a partial
> > advice occurred.
> >
> > On error, -1 is returned and errno is set appropriately.
>
> We recently standardized some wording here:
> s/appropriately/to indicate the error/.

ack

>
>
> >>> +.PP
> >>> +The pointer
> >>> +.I iovec
> >>> +points to an array of iovec structures, defined in
> >>
> >> "iovec" should be formatted as
> >>
> >> .I iovec
> >
> > I think it is formatted that way above. What am I missing?
>
> But also in "an array of iovec structures"...

ack

>
> > BTW, where should I be using .I vs .IR? I was looking for an answer
> > but could not find it.
>
> .B / .I == bold/italic this line
> .BR / .IR == alternate bold/italic with normal (Roman) font.
>
> So:
> .I iovec
> .I iovec ,   # so that comma is not italic
> .BR process_madvise ()
> etc.

Aha! Got it now. It's clear after your example. Thanks!

>
> [...]
>
> >>> +.I iovec
> >>> +if one of its elements points to an invalid memory
> >>> +region in the remote process. No further elements will be
> >>> +processed beyond that point.
> >>> +.PP
> >>> +Permission to provide a hint to external process is governed by a
> >>> +ptrace access mode
> >>> +.B PTRACE_MODE_READ_REALCREDS
> >>> +check; see
> >>> +.BR ptrace (2)
> >>> +and
> >>> +.B CAP_SYS_ADMIN
> >>> +capability that caller should have in order to affect performance
> >>> +of an external process.
> >>
> >> The preceding sentence is garbled. Missing words?
> >
> > Maybe I worded it incorrectly. What I need to say here is that the
> > caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> > CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> > from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> > tried adding the second one to it, obviously unsuccessfully. Any
> > advice on how to fix that?
>
> I think you already got pretty close. How about:
>
> [[
> Permission to provide a hint to another process is governed by a
> ptrace access mode
> .B PTRACE_MODE_READ_REALCREDS
> check (see
> BR ptrace (2));
> in addition, the caller must have the
> .B CAP_SYS_ADMIN
> capability.
> ]]

Perfect! I'll use that.

>
> [...]
>
> >>> +.TP
> >>> +.B ESRCH
> >>> +No process with ID
> >>> +.I pidfd
> >>> +exists.
> >>
> >> Should this maybe be:
> >> [[
> >> The target process does not exist (i.e., it has terminated and
> >> been waited on).
> >> ]]
> >>
> >> See pidfd_send_signal(2).
> >
> > I "borrowed" mine from
> > https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> > either one sounds good to me. Maybe for pidfd_send_signal the wording
> > about termination is more important. Anyway, it's up to you. Just let
> > me know which one to use.
>
> I think the pidfd_send_signal(2) wording fits better.

ack, will use pidfd_send_signal(2) version.

>
> [...]
>
> Thanks,
>
> Michael

I'll include your and Michal's suggestions and will post the next
version later today or tomorrow morning.
Thanks for the guidance!

>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-28 Thread Suren Baghdasaryan

On Tue, Jan 26, 2021 at 5:52 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Wed 20-01-21 14:17:39, Jann Horn wrote:
> > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > > > >
> > > > > On 01/12, Michal Hocko wrote:
> > > > > >
> > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > >
> > > > > > > What we want is the ability for one process to influence another 
> > > > > > > process
> > > > > > > in order to optimize performance across the entire system while 
> > > > > > > leaving
> > > > > > > the security boundary intact.
> > > > > > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > metadata
> > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > >
> > > > > > I have to say that ptrace modes are rather obscure to me. So I 
> > > > > > cannot
> > > > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > > > always been that this is requred to RO access to the address space. 
> > > > > > But
> > > > > > this operation clearly has a visible side effect. Do we have any 
> > > > > > actual
> > > > > > documentation for the existing modes?
> > > > > >
> > > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > > >
> > > > > Can't comment, sorry. I never understood these security checks and 
> > > > > never tried.
> > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I have no 
> > > > > idea what
> > > > > is the difference.
> >
> > Yama in particular only does its checks on ATTACH and ignores READ,
> > that's the difference you're probably most likely to encounter on a
> > normal desktop system, since some distros turn Yama on by default.
> > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > still works; so you can see things like detailed memory usage
> > information and such, but you're not supposed to be able to directly
> > peek into a running SSH client and inject data into the existing SSH
> > connection, or steal the cryptographic keys for the current
> > connection, or something like that.
> >
> > > > I haven't seen a written explanation on ptrace modes but when I
> > > > consulted Jann his explanation was:
> > > >
> > > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > > the specified domain, across UID boundaries.
> > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > > > specified domain, across UID boundaries.
> > >
> > > Maybe this would be a good start to document expectations. Some more
> > > practical examples where the difference is visible would be great as
> > > well.
> >
> > Before documenting the behavior, it would be a good idea to figure out
> > what to do with perf_event_open(). That one's weird in that it only
> > requires PTRACE_MODE_READ, but actually allows you to sample stuff
> > like userspace stack and register contents (if perf_event_paranoid is
> > 1 or 2). Maybe for SELinux things (and maybe also for Yama), there
> > should be a level in between that allows fully inspecting the process
> > (for purposes like profiling) but without the ability to corrupt its
> > memory or registers or things like that. Or maybe perf_event_open()
> > should just use the ATTACH mode.
>
> Thanks for the clarification. I still cannot say I would have a good
> mental picture. Having something in Documentation/core-api/ sounds
> really needed. Wrt to perf_event_open it sounds really odd it can do
> more than other places restrict indeed. Something for the respective
> maintainer but I strongly suspect people simply copy the pattern from
> other places because the expected semantic is not really clear.
>

Sorry, back to the matters of this patch. Are there any actionable
items for me to take care of before it can be accepted? The only
request from Andrew to write a man page is being worked on at
https://lore.kernel.org/linux-mm/20210120202337.1481402-1-sur...@google.com/
and I'll follow up with the next version. I also CC'ed stable@ for
this to be included into 5.10 per Andrew's request. That CC was lost
at some point, so CC'ing again.

I do not see anything else on this patch to fix. Please chime in if
there are any more concerns, otherwise I would ask Andrew to take it
into mm-tree and stable@ to apply it to 5.10.
Thanks!


> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 1:13 AM Christoph Hellwig  wrote:
>
> On Thu, Jan 28, 2021 at 12:38:17AM -0800, Suren Baghdasaryan wrote:
> > Currently system heap maps its buffers with VM_PFNMAP flag using
> > remap_pfn_range. This results in such buffers not being accounted
> > for in PSS calculations because vm treats this memory as having no
> > page structs. Without page structs there are no counters representing
> > how many processes are mapping a page and therefore PSS calculation
> > is impossible.
> > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > due to memory carveouts that did not have page structs [1]. That
> > is not the case anymore and it seems there was desire to move away
> > from remap_pfn_range [2].
> > Dmabuf system heap design inherits this ION behavior and maps its
> > pages using remap_pfn_range even though allocated pages are backed
> > by page structs.
> > Clear VM_IO and VM_PFNMAP flags when mapping memory allocated by the
> > system heap and replace remap_pfn_range with vm_insert_page, following
> > Laura's suggestion in [1]. This would allow correct PSS calculation
> > for dmabufs.
> >
> > [1] 
> > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > [2] 
> > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > (sorry, could not find lore links for these discussions)
> >
> > Suggested-by: Laura Abbott 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  drivers/dma-buf/heaps/system_heap.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/dma-buf/heaps/system_heap.c 
> > b/drivers/dma-buf/heaps/system_heap.c
> > index 17e0e9a68baf..0e92e42b2251 100644
> > --- a/drivers/dma-buf/heaps/system_heap.c
> > +++ b/drivers/dma-buf/heaps/system_heap.c
> > @@ -200,11 +200,13 @@ static int system_heap_mmap(struct dma_buf *dmabuf, 
> > struct vm_area_struct *vma)
> >   struct sg_page_iter piter;
> >   int ret;
> >
> > + /* All pages are backed by a "struct page" */
> > + vma->vm_flags &= ~VM_PFNMAP;
>
> Why do we clear this flag?  It shouldn't even be set here as far as I
> can tell.

Thanks for the question, Christoph.
I tracked down that flag being set by drm_gem_mmap_obj() which DRM
drivers use to "Set up the VMA to prepare mapping of the GEM object"
(according to drm_gem_mmap_obj comments). I also see a pattern in
several DMR drivers to call drm_gem_mmap_obj()/drm_gem_mmap(), then
clear VM_PFNMAP and then map the VMA (for example here:
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/rockchip/rockchip_drm_gem.c#L246).
I thought that dmabuf allocator (in this case the system heap) would
be the right place to set these flags because it controls how memory
is allocated before mapping. However it's quite possible that I'm
missing the real reason for VM_PFNMAP being set in drm_gem_mmap_obj()
before dma_buf_mmap() is called. I could not find the answer to that,
so I hope someone here can clarify that.

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 12:38 AM Suren Baghdasaryan  wrote:
>
> Currently system heap maps its buffers with VM_PFNMAP flag using
> remap_pfn_range. This results in such buffers not being accounted
> for in PSS calculations because vm treats this memory as having no
> page structs. Without page structs there are no counters representing
> how many processes are mapping a page and therefore PSS calculation
> is impossible.
> Historically, ION driver used to map its buffers as VM_PFNMAP areas
> due to memory carveouts that did not have page structs [1]. That
> is not the case anymore and it seems there was desire to move away
> from remap_pfn_range [2].
> Dmabuf system heap design inherits this ION behavior and maps its
> pages using remap_pfn_range even though allocated pages are backed
> by page structs.
> Clear VM_IO and VM_PFNMAP flags when mapping memory allocated by the

Agrh, please ignore VM_IO in the description. The patch does not touch
that flag. I'll fix that in the next revision.

> system heap and replace remap_pfn_range with vm_insert_page, following
> Laura's suggestion in [1]. This would allow correct PSS calculation
> for dmabufs.
>
> [1] 
> https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> [2] 
> http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> (sorry, could not find lore links for these discussions)
>
> Suggested-by: Laura Abbott 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  drivers/dma-buf/heaps/system_heap.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c 
> b/drivers/dma-buf/heaps/system_heap.c
> index 17e0e9a68baf..0e92e42b2251 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -200,11 +200,13 @@ static int system_heap_mmap(struct dma_buf *dmabuf, 
> struct vm_area_struct *vma)
> struct sg_page_iter piter;
> int ret;
>
> +   /* All pages are backed by a "struct page" */
> +   vma->vm_flags &= ~VM_PFNMAP;
> +
> for_each_sgtable_page(table, , vma->vm_pgoff) {
> struct page *page = sg_page_iter_page();
>
> -   ret = remap_pfn_range(vma, addr, page_to_pfn(page), PAGE_SIZE,
> - vma->vm_page_prot);
> +   ret = vm_insert_page(vma, addr, page);
> if (ret)
> return ret;
> addr += PAGE_SIZE;
> --
> 2.30.0.280.ga3ce27912f-goog
>

[PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-01-28 Thread Suren Baghdasaryan

Currently system heap maps its buffers with VM_PFNMAP flag using
remap_pfn_range. This results in such buffers not being accounted
for in PSS calculations because vm treats this memory as having no
page structs. Without page structs there are no counters representing
how many processes are mapping a page and therefore PSS calculation
is impossible.
Historically, ION driver used to map its buffers as VM_PFNMAP areas
due to memory carveouts that did not have page structs [1]. That
is not the case anymore and it seems there was desire to move away
from remap_pfn_range [2].
Dmabuf system heap design inherits this ION behavior and maps its
pages using remap_pfn_range even though allocated pages are backed
by page structs.
Clear VM_IO and VM_PFNMAP flags when mapping memory allocated by the
system heap and replace remap_pfn_range with vm_insert_page, following
Laura's suggestion in [1]. This would allow correct PSS calculation
for dmabufs.

[1] 
https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
[2] 
http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
(sorry, could not find lore links for these discussions)

Suggested-by: Laura Abbott 
Signed-off-by: Suren Baghdasaryan 
---
 drivers/dma-buf/heaps/system_heap.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c 
b/drivers/dma-buf/heaps/system_heap.c
index 17e0e9a68baf..0e92e42b2251 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -200,11 +200,13 @@ static int system_heap_mmap(struct dma_buf *dmabuf, 
struct vm_area_struct *vma)
struct sg_page_iter piter;
int ret;
 
+   /* All pages are backed by a "struct page" */
+   vma->vm_flags &= ~VM_PFNMAP;
+
for_each_sgtable_page(table, , vma->vm_pgoff) {
struct page *page = sg_page_iter_page();
 
-   ret = remap_pfn_range(vma, addr, page_to_pfn(page), PAGE_SIZE,
- vma->vm_page_prot);
+   ret = vm_insert_page(vma, addr, page);
if (ret)
return ret;
addr += PAGE_SIZE;
-- 
2.30.0.280.ga3ce27912f-goog

Re: [PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-26 Thread Suren Baghdasaryan

On Mon, Jan 25, 2021 at 5:19 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Wed 20-01-21 12:23:37, Suren Baghdasaryan wrote:
> [...]
> > MADV_COLD (since Linux 5.4.1)
> > Deactivate a given range of pages by moving them from active to
> > inactive LRU list. This is done to accelerate the reclaim of these
> > pages. The advice might be ignored for some pages in the range when 
> > it
> > is not applicable.
>
> I do not think we want to talk about active/inactive LRU lists here.
> Wouldn't it be sufficient to say
> Deactive a given range of pages which will make them a more probable
> reclaim target should there be a memory pressure. This is a
> non-destructive operation.

That sounds better. Will update in the next version.

>
> Other than that, looks good to me from the content POV.
>
> Thanks!

Thanks for the review Michal!

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [RESEND][PATCH 2/3] dma-buf: heaps: Add a WARN_ON should the vmap_cnt go negative

2021-01-22 Thread Suren Baghdasaryan

On Thu, Jan 21, 2021 at 11:56 PM Sumit Semwal  wrote:
>
> Hi John, Suren,
>
>
> On Wed, 20 Jan 2021 at 02:15, John Stultz  wrote:
> >
> > We shouldn't vunmap more then we vmap, but if we do, make
> > sure we complain loudly.
>
> I was checking the general usage of vunmap in the kernel, and I
> couldn't find many instances where we need to WARN_ON for the vunmap
> count more than vmap count. Is there a specific need for this in the heaps?

Hi Sumit,
My worry was that buffer->vmap_cnt could silently go negative. But if
this warning is not consistent with other places we do refcounted
vmap/vunmap then feel free to ignore my suggestion.
Thanks!

>
> Best,
> Sumit.
> >
> > Cc: Sumit Semwal 
> > Cc: Liam Mark 
> > Cc: Laura Abbott 
> > Cc: Brian Starkey 
> > Cc: Hridya Valsaraju 
> > Cc: Suren Baghdasaryan 
> > Cc: Sandeep Patil 
> > Cc: Daniel Mentz 
> > Cc: Chris Goldsworthy 
> > Cc: Ørjan Eide 
> > Cc: Robin Murphy 
> > Cc: Ezequiel Garcia 
> > Cc: Simon Ser 
> > Cc: James Jones 
> > Cc: linux-me...@vger.kernel.org
> > Cc: dri-de...@lists.freedesktop.org
> > Suggested-by: Suren Baghdasaryan 
> > Signed-off-by: John Stultz 
> > ---
> >  drivers/dma-buf/heaps/cma_heap.c| 1 +
> >  drivers/dma-buf/heaps/system_heap.c | 1 +
> >  2 files changed, 2 insertions(+)
> >
> > diff --git a/drivers/dma-buf/heaps/cma_heap.c 
> > b/drivers/dma-buf/heaps/cma_heap.c
> > index 364fc2f3e499..0c76cbc3fb11 100644
> > --- a/drivers/dma-buf/heaps/cma_heap.c
> > +++ b/drivers/dma-buf/heaps/cma_heap.c
> > @@ -232,6 +232,7 @@ static void cma_heap_vunmap(struct dma_buf *dmabuf, 
> > struct dma_buf_map *map)
> > struct cma_heap_buffer *buffer = dmabuf->priv;
> >
> > mutex_lock(>lock);
> > +   WARN_ON(buffer->vmap_cnt == 0);
> > if (!--buffer->vmap_cnt) {
> > vunmap(buffer->vaddr);
> > buffer->vaddr = NULL;
> > diff --git a/drivers/dma-buf/heaps/system_heap.c 
> > b/drivers/dma-buf/heaps/system_heap.c
> > index 405351aad2a8..2321c91891f6 100644
> > --- a/drivers/dma-buf/heaps/system_heap.c
> > +++ b/drivers/dma-buf/heaps/system_heap.c
> > @@ -273,6 +273,7 @@ static void system_heap_vunmap(struct dma_buf *dmabuf, 
> > struct dma_buf_map *map)
> > struct system_heap_buffer *buffer = dmabuf->priv;
> >
> > mutex_lock(>lock);
> > +   WARN_ON(buffer->vmap_cnt == 0);
> > if (!--buffer->vmap_cnt) {
> > vunmap(buffer->vaddr);
> > buffer->vaddr = NULL;
> > --
> > 2.17.1
> >

Re: [PATCH v3 1/4] mm: cma: introduce gfp flag in cma_alloc instead of no_warn

2021-01-20 Thread Suren Baghdasaryan

On Tue, Jan 12, 2021 at 5:21 PM Minchan Kim  wrote:
>
> The upcoming patch will introduce __GFP_NORETRY semantic
> in alloc_contig_range which is a failfast mode of the API.
> Instead of adding a additional parameter for gfp, replace
> no_warn with gfp flag.
>
> To keep old behaviors, it follows the rule below.
>
>   no_warn   gfp_flags
>
>   false GFP_KERNEL
>   true  GFP_KERNEL|__GFP_NOWARN
>   gfp & __GFP_NOWARNGFP_KERNEL | (gfp & __GFP_NOWARN)
>
> Signed-off-by: Minchan Kim 

Reviewed-by: Suren Baghdasaryan 

> ---
>  drivers/dma-buf/heaps/cma_heap.c |  2 +-
>  drivers/s390/char/vmcp.c |  2 +-
>  include/linux/cma.h  |  2 +-
>  kernel/dma/contiguous.c  |  3 ++-
>  mm/cma.c | 12 ++--
>  mm/cma_debug.c   |  2 +-
>  mm/hugetlb.c |  6 --
>  mm/secretmem.c   |  3 ++-
>  8 files changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/cma_heap.c 
> b/drivers/dma-buf/heaps/cma_heap.c
> index 364fc2f3e499..0afc1907887a 100644
> --- a/drivers/dma-buf/heaps/cma_heap.c
> +++ b/drivers/dma-buf/heaps/cma_heap.c
> @@ -298,7 +298,7 @@ static int cma_heap_allocate(struct dma_heap *heap,
> if (align > CONFIG_CMA_ALIGNMENT)
> align = CONFIG_CMA_ALIGNMENT;
>
> -   cma_pages = cma_alloc(cma_heap->cma, pagecount, align, false);
> +   cma_pages = cma_alloc(cma_heap->cma, pagecount, align, GFP_KERNEL);
> if (!cma_pages)
> goto free_buffer;
>
> diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
> index 9e066281e2d0..78f9adf56456 100644
> --- a/drivers/s390/char/vmcp.c
> +++ b/drivers/s390/char/vmcp.c
> @@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session 
> *session)
>  * anymore the system won't work anyway.
>  */
> if (order > 2)
> -   page = cma_alloc(vmcp_cma, nr_pages, 0, false);
> +   page = cma_alloc(vmcp_cma, nr_pages, 0, GFP_KERNEL);
> if (page) {
> session->response = (char *)page_to_phys(page);
> session->cma_alloc = 1;
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 217999c8a762..d6c02d08ddbc 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -45,7 +45,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
> const char *name,
> struct cma **res_cma);
>  extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int 
> align,
> - bool no_warn);
> + gfp_t gfp_mask);
>  extern bool cma_release(struct cma *cma, const struct page *pages, unsigned 
> int count);
>
>  extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void 
> *data);
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index 3d63d91cba5c..552ed531c018 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -260,7 +260,8 @@ struct page *dma_alloc_from_contiguous(struct device 
> *dev, size_t count,
> if (align > CONFIG_CMA_ALIGNMENT)
> align = CONFIG_CMA_ALIGNMENT;
>
> -   return cma_alloc(dev_get_cma_area(dev), count, align, no_warn);
> +   return cma_alloc(dev_get_cma_area(dev), count, align, GFP_KERNEL |
> +   (no_warn ? __GFP_NOWARN : 0));
>  }
>
>  /**
> diff --git a/mm/cma.c b/mm/cma.c
> index 0ba69cd16aeb..35053b82aedc 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -419,13 +419,13 @@ static inline void cma_debug_show_areas(struct cma 
> *cma) { }
>   * @cma:   Contiguous memory region for which the allocation is performed.
>   * @count: Requested number of pages.
>   * @align: Requested alignment of pages (in PAGE_SIZE order).
> - * @no_warn: Avoid printing message about failed allocation
> + * @gfp_mask: GFP mask to use during during the cma allocation.
>   *
>   * This function allocates part of contiguous memory on specific
>   * contiguous memory area.
>   */
>  struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> -  bool no_warn)
> +  gfp_t gfp_mask)
>  {
> unsigned long mask, offset;
> unsigned long pfn = -1;
> @@ -438,8 +438,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
> unsigned int align,
> if (!cma || !cma->count || !cma->bitmap)
> return NULL;
>
> -   pr

Re: [PATCH v3 4/4] dma-buf: heaps: add chunk heap to dmabuf heaps

2021-01-20 Thread Suren Baghdasaryan

On Tue, Jan 19, 2021 at 7:39 PM Hyesoo Yu  wrote:
>
> On Tue, Jan 19, 2021 at 12:36:40PM -0800, Minchan Kim wrote:
> > On Tue, Jan 19, 2021 at 10:29:29AM -0800, John Stultz wrote:
> > > On Tue, Jan 12, 2021 at 5:22 PM Minchan Kim  wrote:
> > > >
> > > > From: Hyesoo Yu 
> > > >
> > > > This patch supports chunk heap that allocates the buffers that
> > > > arranged into a list a fixed size chunks taken from CMA.
> > > >
> > > > The chunk heap driver is bound directly to a reserved_memory
> > > > node by following Rob Herring's suggestion in [1].
> > > >
> > > > [1] 
> > > > https://lore.kernel.org/lkml/20191025225009.50305-2-john.stu...@linaro.org/T/#m3dc63acd33fea269a584f43bb799a876f0b2b45d
> > > >
> > > > Signed-off-by: Hyesoo Yu 
> > > > Signed-off-by: Hridya Valsaraju 
> > > > Signed-off-by: Minchan Kim 

After addressing John's comments feel free to add Reviewed-by: Suren
Baghdasaryan 

> > > > ---
> > > ...
> > > > +static int register_chunk_heap(struct chunk_heap *chunk_heap_info)
> > > > +{
> > > > +   struct dma_heap_export_info exp_info;
> > > > +
> > > > +   exp_info.name = cma_get_name(chunk_heap_info->cma);
> > >
> > > One potential issue here, you're setting the name to the same as the
> > > CMA name. Since the CMA heap uses the CMA name, if one chunk was
> > > registered as a chunk heap but also was the default CMA area, it might
> > > be registered twice. But since both would have the same name it would
> > > be an initialization race as to which one "wins".
> >
> > Good point. Maybe someone might want to use default CMA area for
> > both cma_heap and chunk_heap. I cannot come up with ideas why we
> > should prohibit it atm.
> >
> > >
> > > So maybe could you postfix the CMA name with "-chunk" or something?
> >
> > Hyesoo, Any opinion?
> > Unless you have something other idea, let's fix it in next version.
> >
>
> I agree that. It is not good to use heap name directly as cma name.
> Let's postfix the name with '-chunk'
>
> Thanks,
> Regards.

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-20 Thread Suren Baghdasaryan

On Wed, Jan 20, 2021 at 8:57 AM Suren Baghdasaryan  wrote:
>
> On Wed, Jan 20, 2021 at 5:18 AM Jann Horn  wrote:
> >
> > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > > > >
> > > > > On 01/12, Michal Hocko wrote:
> > > > > >
> > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > >
> > > > > > > What we want is the ability for one process to influence another 
> > > > > > > process
> > > > > > > in order to optimize performance across the entire system while 
> > > > > > > leaving
> > > > > > > the security boundary intact.
> > > > > > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > metadata
> > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > >
> > > > > > I have to say that ptrace modes are rather obscure to me. So I 
> > > > > > cannot
> > > > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > > > always been that this is requred to RO access to the address space. 
> > > > > > But
> > > > > > this operation clearly has a visible side effect. Do we have any 
> > > > > > actual
> > > > > > documentation for the existing modes?
> > > > > >
> > > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > > >
> > > > > Can't comment, sorry. I never understood these security checks and 
> > > > > never tried.
> > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I have no 
> > > > > idea what
> > > > > is the difference.
> >
> > Yama in particular only does its checks on ATTACH and ignores READ,
> > that's the difference you're probably most likely to encounter on a
> > normal desktop system, since some distros turn Yama on by default.
> > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > still works; so you can see things like detailed memory usage
> > information and such, but you're not supposed to be able to directly
> > peek into a running SSH client and inject data into the existing SSH
> > connection, or steal the cryptographic keys for the current
> > connection, or something like that.
> >
> > > > I haven't seen a written explanation on ptrace modes but when I
> > > > consulted Jann his explanation was:
> > > >
> > > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > > the specified domain, across UID boundaries.
> > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > > > specified domain, across UID boundaries.
> > >
> > > Maybe this would be a good start to document expectations. Some more
> > > practical examples where the difference is visible would be great as
> > > well.
> >
> > Before documenting the behavior, it would be a good idea to figure out
> > what to do with perf_event_open(). That one's weird in that it only
> > requires PTRACE_MODE_READ, but actually allows you to sample stuff
> > like userspace stack and register contents (if perf_event_paranoid is
> > 1 or 2). Maybe for SELinux things (and maybe also for Yama), there
> > should be a level in between that allows fully inspecting the process
> > (for purposes like profiling) but without the ability to corrupt its
> > memory or registers or things like that. Or maybe perf_event_open()
> > should just use the ATTACH mode.
>
> Thanks for additional clarifications, Jann!
> Just to clarify, the documentation I'm preparing is a man page for
> process_madvise(2) which will list the required capabilities but won't
> dive into all the security details.
> I believe the above suggestions are for documenting different PTRACE
> modes and will not be included in that man page. Maybe a separate
> document could do that but I'm definitely not qualified to write it.

Folks, I posted the man page here:
https://lore.kernel.org/linux-mm/20210120202337.1481402-1-sur...@google.com/

Also I realized that this patch is not changing at all and if I send a
new version, the only difference would be CC'ing it to stable and
linux-security-module.
I'm CC'ing stable (James already CC'ed LSM), but if I should re-post
it please let me know.

Cc: sta...@vger.kernel.org # 5.10+

[PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-20 Thread Suren Baghdasaryan

Initial version of process_madvise(2) manual page. Initial text was
extracted from [1], amended after fix [2] and more details added using
man pages of madvise(2) and process_vm_read(2) as examples. It also
includes the changes to required permission proposed in [3].

[1] https://lore.kernel.org/patchwork/patch/1297933/
[2] https://lkml.org/lkml/2020/12/8/1282
[3] 
https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311

Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Minchan Kim 
---

Adding the plane text version for ease of review:

NAME
process_madvise - give advice about use of memory to a process

SYNOPSIS
#include 

ssize_t process_madvise(int pidfd,
   const struct iovec *iovec,
   unsigned long vlen,
   int advice,
   unsigned int flags);

DESCRIPTION
The process_madvise() system call is used to give advice or directions to
the kernel about the address ranges from external process as well as local
process. It provides the advice to address ranges of process described by
iovec and vlen. The goal of such advice is to improve system or application
performance.

The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (see pidofd_open(2) for further information).

The pointer iovec points to an array of iovec structures, defined in
 as:

struct iovec {
void  *iov_base;/* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};

The iovec describes address ranges beginning at iov_base address and with
the size of iov_len bytes.

The vlen represents the number of elements in iovec.

The advice can be one of the values listed below.

  Linux-specific advice values
The following Linux-specific advice values have no counterparts in the
POSIX-specified posix_madvise(3), and may or may not have counterparts in
the madvise() interface available on other implementations.

MADV_COLD (since Linux 5.4.1)
Deactivate a given range of pages by moving them from active to
inactive LRU list. This is done to accelerate the reclaim of these
pages. The advice might be ignored for some pages in the range when it
is not applicable.
MADV_PAGEOUT (since Linux 5.4.1)
Reclaim a given range of pages. This is done to free up memory occupied
by these pages. If a page is anonymous it will be swapped out. If a
page is file-backed and dirty it will be written back into the backing
storage. The advice might be ignored for some pages in the range when
it is not applicable.

The flags argument is reserved for future use; currently, this argument must
be specified as 0.

The value specified in the vlen argument must be less than or equal to
IOV_MAX (defined in  or accessible via the call
sysconf(_SC_IOV_MAX)).

The vlen and iovec arguments are checked before applying any hints. If the
vlen is too big, or iovec is invalid, an error will be returned
immediately.

Hint might be applied to a part of iovec if one of its elements points to
an invalid memory region in the remote process. No further elements will be
processed beyond that point.

Permission to provide a hint to external process is governed by a ptrace
access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2) and
CAP_SYS_ADMIN capability that caller should have in order to affect
performance of an external process.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised. This
return value may be less than the total number of requested bytes, if an
error occurred. The caller should check return value to determine whether
a partial advice occurred.
ERRORS
EFAULT The memory described by iovec is outside the accessible address
   space of the process pid.
EINVAL flags is not 0.
EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
EINVAL vlen is too large.
ENOMEM Could not allocate memory for internal copies of the iovec
   structures.
EPERM The caller does not have permission to access the address space of
  the process pidfd.
ESRCH No process with ID pidfd exists.

VERSIONS
Since Linux 5.10, support for this system call is optional, depending on
the setting of the CONFIG_ADVISE_SYSCALLS configuration option.

SEE ALSO
madvise(2), pidofd_open(2), process_vm_readv(2), process_vm_write(2)

 man2/process_madvise.2 | 208 +
 1 file changed, 208 insertions(+)
 create mode 100644 man2/process_madvise.2

diff --git a/man2/process_madvise.2 b/man2/process_madvise.2
new file mode 100644
index 0..9bb5cb5ed
--- /dev/null
+++ b/man2/process_madvise.2
@@ -0,0 +1,208 @@
+.\" Copyright (C)

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-20 Thread Suren Baghdasaryan

On Wed, Jan 20, 2021 at 5:18 AM Jann Horn  wrote:
>
> On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > > >
> > > > On 01/12, Michal Hocko wrote:
> > > > >
> > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > >
> > > > > > What we want is the ability for one process to influence another 
> > > > > > process
> > > > > > in order to optimize performance across the entire system while 
> > > > > > leaving
> > > > > > the security boundary intact.
> > > > > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > >
> > > > > I have to say that ptrace modes are rather obscure to me. So I cannot
> > > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > > always been that this is requred to RO access to the address space. 
> > > > > But
> > > > > this operation clearly has a visible side effect. Do we have any 
> > > > > actual
> > > > > documentation for the existing modes?
> > > > >
> > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > >
> > > > Can't comment, sorry. I never understood these security checks and 
> > > > never tried.
> > > > IIUC only selinux/etc can treat ATTACH/READ differently and I have no 
> > > > idea what
> > > > is the difference.
>
> Yama in particular only does its checks on ATTACH and ignores READ,
> that's the difference you're probably most likely to encounter on a
> normal desktop system, since some distros turn Yama on by default.
> Basically the idea there is that running "gdb -p $pid" or "strace -p
> $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> still works; so you can see things like detailed memory usage
> information and such, but you're not supposed to be able to directly
> peek into a running SSH client and inject data into the existing SSH
> connection, or steal the cryptographic keys for the current
> connection, or something like that.
>
> > > I haven't seen a written explanation on ptrace modes but when I
> > > consulted Jann his explanation was:
> > >
> > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > the specified domain, across UID boundaries.
> > > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > > specified domain, across UID boundaries.
> >
> > Maybe this would be a good start to document expectations. Some more
> > practical examples where the difference is visible would be great as
> > well.
>
> Before documenting the behavior, it would be a good idea to figure out
> what to do with perf_event_open(). That one's weird in that it only
> requires PTRACE_MODE_READ, but actually allows you to sample stuff
> like userspace stack and register contents (if perf_event_paranoid is
> 1 or 2). Maybe for SELinux things (and maybe also for Yama), there
> should be a level in between that allows fully inspecting the process
> (for purposes like profiling) but without the ability to corrupt its
> memory or registers or things like that. Or maybe perf_event_open()
> should just use the ATTACH mode.

Thanks for additional clarifications, Jann!
Just to clarify, the documentation I'm preparing is a man page for
process_madvise(2) which will list the required capabilities but won't
dive into all the security details.
I believe the above suggestions are for documenting different PTRACE
modes and will not be included in that man page. Maybe a separate
document could do that but I'm definitely not qualified to write it.

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-20 Thread Suren Baghdasaryan

On Tue, Jan 19, 2021 at 9:02 PM James Morris  wrote:
>
> On Mon, 11 Jan 2021, Suren Baghdasaryan wrote:
>
> > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > and CAP_SYS_NICE for influencing process performance.
>
>
> Almost missed these -- please cc the LSM mailing list when modifying
> capabilities or other LSM-related things.

Thanks for the note. Will definitely include it when sending the next version.

>
> --
> James Morris
> 
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-13 Thread Suren Baghdasaryan

On Wed, Jan 13, 2021 at 6:22 AM Michal Hocko  wrote:
>
> On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > >
> > > On 01/12, Michal Hocko wrote:
> > > >
> > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > >
> > > > > What we want is the ability for one process to influence another 
> > > > > process
> > > > > in order to optimize performance across the entire system while 
> > > > > leaving
> > > > > the security boundary intact.
> > > > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > > > > and CAP_SYS_NICE for influencing process performance.
> > > >
> > > > I have to say that ptrace modes are rather obscure to me. So I cannot
> > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > always been that this is requred to RO access to the address space. But
> > > > this operation clearly has a visible side effect. Do we have any actual
> > > > documentation for the existing modes?
> > > >
> > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > >
> > > Can't comment, sorry. I never understood these security checks and never 
> > > tried.
> > > IIUC only selinux/etc can treat ATTACH/READ differently and I have no 
> > > idea what
> > > is the difference.
> >
> > I haven't seen a written explanation on ptrace modes but when I
> > consulted Jann his explanation was:
> >
> > PTRACE_MODE_READ means you can inspect metadata about processes with
> > the specified domain, across UID boundaries.
> > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > specified domain, across UID boundaries.
>
> Maybe this would be a good start to document expectations. Some more
> practical examples where the difference is visible would be great as
> well.

I'll do my best but I'm also not a security expert. Will post the next
version with a draft for the man page (this syscall does not have a
man page yet AFAIKT) and we can iterate on the wording there.

> > He did agree that in this case PTRACE_MODE_ATTACH seems too
> > restrictive (we do not try to gain full control or impersonate a
> > process) and PTRACE_MODE_READ is a better choice.
>
> All that being said, I am not against the changed behavior but I do not
> feel competent to give an ack.

Great. SOunds like the only missing piece is the man page with more
details. I'll work on it but since it's the first time I will be
contributing to man pages it might take me a couple days. Thanks
everyone for the reviews!

> --
> Michal Hocko
> SUSE Labs

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-12 Thread Suren Baghdasaryan

On Mon, Jan 11, 2021 at 11:46 PM Michal Hocko  wrote:
>
> On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > process_madvise currently requires ptrace attach capability.
> > PTRACE_MODE_ATTACH gives one process complete control over another
> > process. It effectively removes the security boundary between the
> > two processes (in one direction). Granting ptrace attach capability
> > even to a system process is considered dangerous since it creates an
> > attack surface. This severely limits the usage of this API.
> > The operations process_madvise can perform do not affect the correctness
> > of the operation of the target process; they only affect where the data
> > is physically located (and therefore, how fast it can be accessed).
>
> Yes it doesn't influence the correctness but it is still a very
> sensitive operation because it can allow a targeted side channel timing
> attacks so we should be really careful.

Sorry, I missed this comment in my answer. Possibility of affecting
the target's performance including side channel attack is why we
require CAP_SYS_NICE.

>
> > What we want is the ability for one process to influence another process
> > in order to optimize performance across the entire system while leaving
> > the security boundary intact.
> > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > and CAP_SYS_NICE for influencing process performance.
>
> I have to say that ptrace modes are rather obscure to me. So I cannot
> really judge whether MODE_READ is sufficient. My understanding has
> always been that this is requred to RO access to the address space. But
> this operation clearly has a visible side effect. Do we have any actual
> documentation for the existing modes?
>
> I would be really curious to hear from Jann and Oleg (now Cced).
>
> Is CAP_SYS_NICE requirement really necessary?
>
> > Signed-off-by: Suren Baghdasaryan 
> > Acked-by: Minchan Kim 
> > Acked-by: David Rientjes 
> > ---
> >  mm/madvise.c | 13 -
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6a660858784b..a9bcd16b5d95 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> > struct iovec __user *, vec,
> >   goto release_task;
> >   }
> >
> > - mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> >   if (IS_ERR_OR_NULL(mm)) {
> >   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >   goto release_task;
> >   }
> >
> > + /*
> > +  * Require CAP_SYS_NICE for influencing process performance. Note that
> > +  * only non-destructive hints are currently supported.
> > +  */
> > + if (!capable(CAP_SYS_NICE)) {
> > + ret = -EPERM;
> > + goto release_mm;
> > + }
> > +
> >   total_len = iov_iter_count();
> >
> >   while (iov_iter_count()) {
> > @@ -1217,6 +1227,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> > struct iovec __user *, vec,
> >   if (ret == 0)
> >   ret = total_len - iov_iter_count();
> >
> > +release_mm:
> >   mmput(mm);
> >  release_task:
> >   put_task_struct(task);
> > --
> > 2.30.0.284.gd98b1dd5eaa7-goog
> >
>
> --
> Michal Hocko
> SUSE Labs

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-12 Thread Suren Baghdasaryan

On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
>
> On 01/12, Michal Hocko wrote:
> >
> > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> >
> > > What we want is the ability for one process to influence another process
> > > in order to optimize performance across the entire system while leaving
> > > the security boundary intact.
> > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > > and CAP_SYS_NICE for influencing process performance.
> >
> > I have to say that ptrace modes are rather obscure to me. So I cannot
> > really judge whether MODE_READ is sufficient. My understanding has
> > always been that this is requred to RO access to the address space. But
> > this operation clearly has a visible side effect. Do we have any actual
> > documentation for the existing modes?
> >
> > I would be really curious to hear from Jann and Oleg (now Cced).
>
> Can't comment, sorry. I never understood these security checks and never 
> tried.
> IIUC only selinux/etc can treat ATTACH/READ differently and I have no idea 
> what
> is the difference.

I haven't seen a written explanation on ptrace modes but when I
consulted Jann his explanation was:

PTRACE_MODE_READ means you can inspect metadata about processes with
the specified domain, across UID boundaries.
PTRACE_MODE_ATTACH means you can fully impersonate processes with the
specified domain, across UID boundaries.

He did agree that in this case PTRACE_MODE_ATTACH seems too
restrictive (we do not try to gain full control or impersonate a
process) and PTRACE_MODE_READ is a better choice.

>
> Oleg.
>

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-12 Thread Suren Baghdasaryan

On Mon, Jan 11, 2021 at 5:22 PM Andrew Morton  wrote:
>
> On Mon, 11 Jan 2021 09:06:22 -0800 Suren Baghdasaryan  
> wrote:
>
> > process_madvise currently requires ptrace attach capability.
> > PTRACE_MODE_ATTACH gives one process complete control over another
> > process. It effectively removes the security boundary between the
> > two processes (in one direction). Granting ptrace attach capability
> > even to a system process is considered dangerous since it creates an
> > attack surface. This severely limits the usage of this API.
> > The operations process_madvise can perform do not affect the correctness
> > of the operation of the target process; they only affect where the data
> > is physically located (and therefore, how fast it can be accessed).
> > What we want is the ability for one process to influence another process
> > in order to optimize performance across the entire system while leaving
> > the security boundary intact.
> > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > and CAP_SYS_NICE for influencing process performance.
>
> It would be useful to see the proposed manpage update.
>
> process_madvise() was released in 5.10, so this is a
> non-backward-compatible change to a released kernel.
>
> I think it would be OK at this stage to feed this into 5.10.x with a
> cc:stable and suitable words in the changelog explaining why we're
> doing this.

Sure, I will post another patchset that will include manpage update
and will CC:stable. That's of course after Michal's concerns are
addressed.
Thanks!

>
> Alternatively we could retain PTRACE_MODE_ATTACH's behaviour and add
> PTRACE_MODE_READ_SYS_NICE alongside that.

Re: [PATCH 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-11 Thread Suren Baghdasaryan

On Mon, Jan 11, 2021 at 9:05 AM Suren Baghdasaryan  wrote:
>
> On Mon, Jan 11, 2021 at 2:20 AM Florian Weimer  wrote:
> >
> > * Suren Baghdasaryan:
> >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6a660858784b..c2d600386902 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> > > const struct iovec __user *, vec,
> > >   goto release_task;
> > >   }
> > >
> > > - mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > > + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > > + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> > >   if (IS_ERR_OR_NULL(mm)) {
> > >   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > >   goto release_task;
> > >   }
> >
> > Shouldn't this depend on the requested behavior?  Several operations
> > directly result in observable changes, and go beyond performance tuning.
>
> Thanks for the comment Florian.
> process_madvise supports only MADV_COLD and MADV_PAGEOUT hints which
> are both non-destructive (see process_madvise_behavior_valid()
> function). Maybe you meant something else by "observable changes", if
> so please clarify.
> Thanks,
> Suren.
>

V2 with Minchan's fix is posted at:
https://lore.kernel.org/lkml/2021070622.2613577-1-sur...@google.com/T/#u

> >
> > Thanks,
> > Florian
> > --
> > Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
> > Commercial register: Amtsgericht Muenchen, HRB 153243,
> > Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael 
> > O'Neill
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to kernel-team+unsubscr...@android.com.
> >

[PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-11 Thread Suren Baghdasaryan

process_madvise currently requires ptrace attach capability.
PTRACE_MODE_ATTACH gives one process complete control over another
process. It effectively removes the security boundary between the
two processes (in one direction). Granting ptrace attach capability
even to a system process is considered dangerous since it creates an
attack surface. This severely limits the usage of this API.
The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data
is physically located (and therefore, how fast it can be accessed).
What we want is the ability for one process to influence another process
in order to optimize performance across the entire system while leaving
the security boundary intact.
Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
and CAP_SYS_NICE for influencing process performance.

Signed-off-by: Suren Baghdasaryan 
Acked-by: Minchan Kim 
Acked-by: David Rientjes 
---
 mm/madvise.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6a660858784b..a9bcd16b5d95 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
struct iovec __user *, vec,
goto release_task;
}
 
-   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
+   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}
 
+   /*
+* Require CAP_SYS_NICE for influencing process performance. Note that
+* only non-destructive hints are currently supported.
+*/
+   if (!capable(CAP_SYS_NICE)) {
+   ret = -EPERM;
+   goto release_mm;
+   }
+
total_len = iov_iter_count();
 
while (iov_iter_count()) {
@@ -1217,6 +1227,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct 
iovec __user *, vec,
if (ret == 0)
ret = total_len - iov_iter_count();
 
+release_mm:
mmput(mm);
 release_task:
put_task_struct(task);
-- 
2.30.0.284.gd98b1dd5eaa7-goog

Re: [PATCH 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-11 Thread Suren Baghdasaryan

On Mon, Jan 11, 2021 at 2:20 AM Florian Weimer  wrote:
>
> * Suren Baghdasaryan:
>
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6a660858784b..c2d600386902 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> > struct iovec __user *, vec,
> >   goto release_task;
> >   }
> >
> > - mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> >   if (IS_ERR_OR_NULL(mm)) {
> >   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >   goto release_task;
> >   }
>
> Shouldn't this depend on the requested behavior?  Several operations
> directly result in observable changes, and go beyond performance tuning.

Thanks for the comment Florian.
process_madvise supports only MADV_COLD and MADV_PAGEOUT hints which
are both non-destructive (see process_madvise_behavior_valid()
function). Maybe you meant something else by "observable changes", if
so please clarify.
Thanks,
Suren.

>
> Thanks,
> Florian
> --
> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael 
> O'Neill
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-08 Thread Suren Baghdasaryan

On Fri, Jan 8, 2021 at 5:02 PM David Rientjes  wrote:
>
> On Fri, 8 Jan 2021, Suren Baghdasaryan wrote:
>
> > > > @@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> > > > const struct iovec __user *, vec,
> > > >   goto release_task;
> > > >   }
> > > >
> > > > - mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > > > + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > > > + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> > > >   if (IS_ERR_OR_NULL(mm)) {
> > > >   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > > >   goto release_task;
> > > >   }
> > > >
> > > > + /*
> > > > +  * Require CAP_SYS_NICE for influencing process performance. Note 
> > > > that
> > > > +  * only non-destructive hints are currently supported.
> > > > +  */
> > > > + if (!capable(CAP_SYS_NICE)) {
> > > > + ret = -EPERM;
> > > > + goto release_task;
> > >
> > > mmput?
> >
> > Ouch! Thanks for pointing it out! Will include in the next respin.
> >
>
> With the fix, feel free to add:
>
> Acked-by: David Rientjes 

Thanks! Will post a new version with the fix on Monday.

>
> Thanks Suren!

Re: [PATCH 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-08 Thread Suren Baghdasaryan

On Fri, Jan 8, 2021 at 2:15 PM Minchan Kim  wrote:
>
> On Fri, Jan 08, 2021 at 12:58:57PM -0800, Suren Baghdasaryan wrote:
> > process_madvise currently requires ptrace attach capability.
> > PTRACE_MODE_ATTACH gives one process complete control over another
> > process. It effectively removes the security boundary between the
> > two processes (in one direction). Granting ptrace attach capability
> > even to a system process is considered dangerous since it creates an
> > attack surface. This severely limits the usage of this API.
> > The operations process_madvise can perform do not affect the correctness
> > of the operation of the target process; they only affect where the data
> > is physically located (and therefore, how fast it can be accessed).
> > What we want is the ability for one process to influence another process
> > in order to optimize performance across the entire system while leaving
> > the security boundary intact.
> > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > and CAP_SYS_NICE for influencing process performance.
> >
> > Signed-off-by: Suren Baghdasaryan 
>
> It sounds logical to me.
> If security folks don't see any concern and fix below,
>
> Acked-by: Minchan Kim 
>
> > @@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> > struct iovec __user *, vec,
> >   goto release_task;
> >   }
> >
> > - mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> >   if (IS_ERR_OR_NULL(mm)) {
> >   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >   goto release_task;
> >   }
> >
> > + /*
> > +  * Require CAP_SYS_NICE for influencing process performance. Note that
> > +  * only non-destructive hints are currently supported.
> > +  */
> > + if (!capable(CAP_SYS_NICE)) {
> > + ret = -EPERM;
> > + goto release_task;
>
> mmput?

Ouch! Thanks for pointing it out! Will include in the next respin.

>
> > + }
> > +
> >   total_len = iov_iter_count();
> >
> >   while (iov_iter_count()) {
> > --
> > 2.30.0.284.gd98b1dd5eaa7-goog
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

[PATCH 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-08 Thread Suren Baghdasaryan

process_madvise currently requires ptrace attach capability.
PTRACE_MODE_ATTACH gives one process complete control over another
process. It effectively removes the security boundary between the
two processes (in one direction). Granting ptrace attach capability
even to a system process is considered dangerous since it creates an
attack surface. This severely limits the usage of this API.
The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data
is physically located (and therefore, how fast it can be accessed).
What we want is the ability for one process to influence another process
in order to optimize performance across the entire system while leaving
the security boundary intact.
Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
and CAP_SYS_NICE for influencing process performance.

Signed-off-by: Suren Baghdasaryan 
---
 mm/madvise.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6a660858784b..c2d600386902 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
struct iovec __user *, vec,
goto release_task;
}
 
-   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
+   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}
 
+   /*
+* Require CAP_SYS_NICE for influencing process performance. Note that
+* only non-destructive hints are currently supported.
+*/
+   if (!capable(CAP_SYS_NICE)) {
+   ret = -EPERM;
+   goto release_task;
+   }
+
total_len = iov_iter_count();
 
while (iov_iter_count()) {
-- 
2.30.0.284.gd98b1dd5eaa7-goog

Re: [PATCH v5 10/15] sched: Introduce force_compatible_cpus_allowed_ptr() to limit CPU affinity

2020-12-27 Thread Suren Baghdasaryan

Just a couple minor nits.

On Tue, Dec 8, 2020 at 5:29 AM Will Deacon  wrote:
>
> Asymmetric systems may not offer the same level of userspace ISA support
> across all CPUs, meaning that some applications cannot be executed by
> some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
> not feature support for 32-bit applications on both clusters.
>
> Although userspace can carefully manage the affinity masks for such
> tasks, one place where it is particularly problematic is execve()
> because the CPU on which the execve() is occurring may be incompatible
> with the new application image. In such a situation, it is desirable to
> restrict the affinity mask of the task and ensure that the new image is
> entered on a compatible CPU. From userspace's point of view, this looks
> the same as if the incompatible CPUs have been hotplugged off in the
> task's affinity mask.
>
> In preparation for restricting the affinity mask for compat tasks on
> arm64 systems without uniform support for 32-bit applications, introduce
> force_compatible_cpus_allowed_ptr(), which restricts the affinity mask
> for a task to contain only compatible CPUs.
>
> Reviewed-by: Quentin Perret 
> Signed-off-by: Will Deacon 
> ---
>  include/linux/sched.h |   1 +
>  kernel/sched/core.c   | 100 +++---
>  2 files changed, 86 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 76cd21fa5501..e42dd0fb85c5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1653,6 +1653,7 @@ extern int task_can_attach(struct task_struct *p, const 
> struct cpumask *cs_cpus_
>  #ifdef CONFIG_SMP
>  extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> *new_mask);
>  extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask 
> *new_mask);
> +extern void force_compatible_cpus_allowed_ptr(struct task_struct *p);
>  #else
>  static inline void do_set_cpus_allowed(struct task_struct *p, const struct 
> cpumask *new_mask)
>  {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 92ac3e53f50a..1cfc94be18a9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1863,25 +1863,19 @@ void do_set_cpus_allowed(struct task_struct *p, const 
> struct cpumask *new_mask)
>  }
>
>  /*
> - * Change a given task's CPU affinity. Migrate the thread to a
> - * proper CPU and schedule it away if the CPU it's executing on
> - * is removed from the allowed bitmask.
> - *
> - * NOTE: the caller must have a valid reference to the task, the
> - * task must not exit() & deallocate itself prematurely. The
> - * call is not atomic; no spinlocks may be held.
> + * Called with both p->pi_lock and rq->lock held; drops both before 
> returning.

Maybe annotate with __releases()?

>   */
> -static int __set_cpus_allowed_ptr(struct task_struct *p,
> - const struct cpumask *new_mask, bool check)
> +static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
> +const struct cpumask *new_mask,
> +bool check,
> +struct rq *rq,
> +struct rq_flags *rf)
>  {
> const struct cpumask *cpu_valid_mask = cpu_active_mask;
> const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
> unsigned int dest_cpu;
> -   struct rq_flags rf;
> -   struct rq *rq;
> int ret = 0;
>
> -   rq = task_rq_lock(p, );
> update_rq_clock(rq);
>
> if (p->flags & PF_KTHREAD) {
> @@ -1936,7 +1930,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> if (task_running(rq, p) || p->state == TASK_WAKING) {
> struct migration_arg arg = { p, dest_cpu };
> /* Need help from migration thread: drop lock and wait. */
> -   task_rq_unlock(rq, p, );
> +   task_rq_unlock(rq, p, rf);
> stop_one_cpu(cpu_of(rq), migration_cpu_stop, );
> return 0;
> } else if (task_on_rq_queued(p)) {
> @@ -1944,20 +1938,96 @@ static int __set_cpus_allowed_ptr(struct task_struct 
> *p,
>  * OK, since we're going to drop the lock immediately
>  * afterwards anyway.
>  */
> -   rq = move_queued_task(rq, , p, dest_cpu);
> +   rq = move_queued_task(rq, rf, p, dest_cpu);
> }
>  out:
> -   task_rq_unlock(rq, p, );
> +   task_rq_unlock(rq, p, rf);
>
> return ret;
>  }
>
> +/*
> + * Change a given task's CPU affinity. Migrate the thread to a
> + * proper CPU and schedule it away if the CPU it's executing on
> + * is removed from the allowed bitmask.
> + *
> + * NOTE: the caller must have a valid reference to the task, the
> + * task must not exit() & deallocate itself prematurely. The
> + * call is not atomic; no spinlocks may

Re: [PATCH v5 08/15] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()

2020-12-27 Thread Suren Baghdasaryan

On Tue, Dec 8, 2020 at 5:29 AM Will Deacon  wrote:
>
> Asymmetric systems may not offer the same level of userspace ISA support
> across all CPUs, meaning that some applications cannot be executed by
> some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
> not feature support for 32-bit applications on both clusters.
>
> Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
> account when trying to find a suitable set of online CPUs for a given
> task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
> during ->attach() and will subsequently allow the cpuset hierarchy to be
> taken into account when forcefully overriding the affinity mask for a
> task which requires migration to a compatible CPU.
>
> Cc: Li Zefan 
> Cc: Tejun Heo 
> Cc: Johannes Weiner 
> Signed-off-by: Will Deacon 
> ---
>  include/linux/cpuset.h |  3 ++-
>  kernel/cgroup/cpuset.c | 33 +++--
>  2 files changed, 21 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 04c20de66afc..414a8e694413 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #ifdef CONFIG_CPUSETS
> @@ -184,7 +185,7 @@ static inline void cpuset_read_unlock(void) { }
>  static inline void cpuset_cpus_allowed(struct task_struct *p,
>struct cpumask *mask)
>  {
> -   cpumask_copy(mask, cpu_possible_mask);
> +   cpumask_copy(mask, task_cpu_possible_mask(p));
>  }
>
>  static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index e970737c3ed2..d30febf1f69f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -372,18 +372,26 @@ static inline bool is_in_v2_mode(void)
>  }
>
>  /*
> - * Return in pmask the portion of a cpusets's cpus_allowed that
> - * are online.  If none are online, walk up the cpuset hierarchy
> - * until we find one that does have some online cpus.
> + * Return in pmask the portion of a task's cpusets's cpus_allowed that
> + * are online and are capable of running the task.  If none are found,
> + * walk up the cpuset hierarchy until we find one that does have some
> + * appropriate cpus.
>   *
>   * One way or another, we guarantee to return some non-empty subset
>   * of cpu_online_mask.
>   *
>   * Call with callback_lock or cpuset_mutex held.
>   */
> -static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
> +static void guarantee_online_cpus(struct task_struct *tsk,
> + struct cpumask *pmask)
>  {
> -   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
> +   struct cpuset *cs = task_cs(tsk);
> +   const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +
> +   if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))

IIUC, this represents the case when there is no online CPU that can
run this task. In this situation guarantee_online_cpus() will return
an online CPU which can't run the task (because we ignore
possible_mask). I don't think this can be considered a valid fallback
path. However I think patch [13/15] ensures that we never end up in
this situation by disallowing to offline the last 32-bit capable CPU.
If that's true then maybe the patches can be reordered so that [13/15]
comes before this one and this condition can be treated as a bug here?


> +   cpumask_copy(pmask, cpu_online_mask);
> +
> +   while (!cpumask_intersects(cs->effective_cpus, pmask)) {
> cs = parent_cs(cs);
> if (unlikely(!cs)) {
> /*
> @@ -393,11 +401,10 @@ static void guarantee_online_cpus(struct cpuset *cs, 
> struct cpumask *pmask)
>  * cpuset's effective_cpus is on its way to be
>  * identical to cpu_online_mask.
>  */
> -   cpumask_copy(pmask, cpu_online_mask);
> return;
> }
> }
> -   cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
> +   cpumask_and(pmask, pmask, cs->effective_cpus);
>  }
>
>  /*
> @@ -2176,15 +2183,13 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>
> percpu_down_write(_rwsem);
>
> -   /* prepare for attach */
> -   if (cs == _cpuset)
> -   cpumask_copy(cpus_attach, cpu_possible_mask);
> -   else
> -   guarantee_online_cpus(cs, cpus_attach);
> -
> guarantee_online_mems(cs, _attach_nodemask_to);
>
> cgroup_taskset_for_each(task, css, tset) {
> +   if (cs != _cpuset)
> +   guarantee_online_cpus(task, cpus_attach);
> +   else
> +   cpumask_copy(cpus_attach, 
> task_cpu_possible_mask(task));
> /*
>

Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-23 Thread Suren Baghdasaryan

On Tue, Dec 22, 2020 at 11:57 PM Christoph Hellwig  wrote:
>
> On Tue, Dec 22, 2020 at 09:48:43AM -0800, Suren Baghdasaryan wrote:
> > Thanks for the feedback! The use case is userspace memory reaping
> > similar to oom-reaper. Detailed justification is here:
> > https://lore.kernel.org/linux-mm/20201124053943.1684874-1-sur...@google.com
>
> Given that this new variant of process_madvise
>
>   a) does not work on an address range

True, however I can see other madvise flavors that could be used on
the entire process. For example process_madvise(MADV_PAGEOUT) could be
used to "shrink" an entire inactive background process.

>   b) is destructive

I agree that memory reaping might be the only case when a destructive
process_madvise() makes sense. Unless the target process is dying, a
destructive process_madvise() would need coordination with the target
process, and if it's coordinated then the target might as well call
normal madvise() itself.

>   c) doesn't share much code at all with the rest of process_madvise

It actually does reuse a considerable part of the code, but the same
code can be refactored and reused either way.

>
> Why not add a proper separate syscall?

I think my answer to (a) is one justification for allowing
process_madvise() to operate on the entire process. Also MADV_DONTNEED
seems quite suitable for this operation.
Considering the above answers, are you still leaning towards a separate syscall?

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-22 Thread Suren Baghdasaryan

On Tue, Dec 22, 2020 at 9:48 AM Suren Baghdasaryan  wrote:
>
> On Tue, Dec 22, 2020 at 5:44 AM Christoph Hellwig  wrote:
> >
> > On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > > > Can we just use one element in iovec to indicate entire address rather
> > > > than using up the reserved flags?
> > > >
> > > > struct iovec {
> > > > .iov_base = NULL,
> > > > .iov_len = (~(size_t)0),
> > > > };
> > >
> > > In addition to Suren's objections, I think it's also worth considering
> > > how this looks in terms of compat API. If a compat process does
> > > process_madvise() on another compat process, it would be specifying
> > > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > > you'd need special code to catch that case, which would be ugly.
> > >
> > > And when a compat process uses this API on a non-compat process, it
> > > semantically gets really weird: The actual address range covered would
> > > be larger than the address range specified.
> > >
> > > And if we want different access checks for the two flavors in the
> > > future, gating that different behavior on special values in the iovec
> > > would feel too magical to me.
> > >
> > > And the length value SIZE_MAX doesn't really make sense anyway because
> > > the length of the whole address space would be SIZE_MAX+1, which you
> > > can't express.
> > >
> > > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > > a magic number here.
> >
> > Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
> > a huge fan of a flag either.  What is the use case for the madvise
> > to all of a processes address space anyway?
>
> Thanks for the feedback! The use case is userspace memory reaping
> similar to oom-reaper. Detailed justification is here:
> https://lore.kernel.org/linux-mm/20201124053943.1684874-1-sur...@google.com

Actually this post in the most informative and includes test results:
https://lore.kernel.org/linux-api/cajucfpgz1kpm3g1gzh+09z7aowkg05qsammisj7h5mdmrrr...@mail.gmail.com/

Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-22 Thread Suren Baghdasaryan

On Tue, Dec 22, 2020 at 5:44 AM Christoph Hellwig  wrote:
>
> On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > > struct iovec {
> > > .iov_base = NULL,
> > > .iov_len = (~(size_t)0),
> > > };
> >
> > In addition to Suren's objections, I think it's also worth considering
> > how this looks in terms of compat API. If a compat process does
> > process_madvise() on another compat process, it would be specifying
> > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > you'd need special code to catch that case, which would be ugly.
> >
> > And when a compat process uses this API on a non-compat process, it
> > semantically gets really weird: The actual address range covered would
> > be larger than the address range specified.
> >
> > And if we want different access checks for the two flavors in the
> > future, gating that different behavior on special values in the iovec
> > would feel too magical to me.
> >
> > And the length value SIZE_MAX doesn't really make sense anyway because
> > the length of the whole address space would be SIZE_MAX+1, which you
> > can't express.
> >
> > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > a magic number here.
>
> Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
> a huge fan of a flag either.  What is the use case for the madvise
> to all of a processes address space anyway?

Thanks for the feedback! The use case is userspace memory reaping
similar to oom-reaper. Detailed justification is here:
https://lore.kernel.org/linux-mm/20201124053943.1684874-1-sur...@google.com

Re: [RFC][PATCH 1/3] dma-buf: heaps: Add deferred-free-helper library code

2020-12-22 Thread Suren Baghdasaryan

Hi John,
Just a couple nits, otherwise looks sane to me.

On Thu, Dec 17, 2020 at 3:06 PM John Stultz  wrote:
>
> This patch provides infrastructure for deferring buffer frees.
>
> This is a feature ION provided which when used with some form
> of a page pool, provides a nice performance boost in an
> allocation microbenchmark. The reason it helps is it allows the
> page-zeroing to be done out of the normal allocation/free path,
> and pushed off to a kthread.

I suggest adding some more description for this API and how it can be
used. IIUC there are 2 uses: lazy deferred freeing using kthread, and
object pooling. no_pool parameter I think deserves some explanation
(disallows pooling when system is under memory pressure).

>
> As not all heaps will find this useful, its implemented as
> a optional helper library that heaps can utilize.
>
> Cc: Sumit Semwal 
> Cc: Liam Mark 
> Cc: Chris Goldsworthy 
> Cc: Laura Abbott 
> Cc: Brian Starkey 
> Cc: Hridya Valsaraju 
> Cc: Suren Baghdasaryan 
> Cc: Sandeep Patil 
> Cc: Daniel Mentz 
> Cc: Ørjan Eide 
> Cc: Robin Murphy 
> Cc: Ezequiel Garcia 
> Cc: Simon Ser 
> Cc: James Jones 
> Cc: linux-me...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Signed-off-by: John Stultz 
> ---
>  drivers/dma-buf/heaps/Kconfig|   3 +
>  drivers/dma-buf/heaps/Makefile   |   1 +
>  drivers/dma-buf/heaps/deferred-free-helper.c | 136 +++
>  drivers/dma-buf/heaps/deferred-free-helper.h |  15 ++
>  4 files changed, 155 insertions(+)
>  create mode 100644 drivers/dma-buf/heaps/deferred-free-helper.c
>  create mode 100644 drivers/dma-buf/heaps/deferred-free-helper.h
>
> diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
> index a5eef06c4226..ecf65204f714 100644
> --- a/drivers/dma-buf/heaps/Kconfig
> +++ b/drivers/dma-buf/heaps/Kconfig
> @@ -1,3 +1,6 @@
> +config DMABUF_HEAPS_DEFERRED_FREE
> +   bool
> +
>  config DMABUF_HEAPS_SYSTEM
> bool "DMA-BUF System Heap"
> depends on DMABUF_HEAPS
> diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
> index 974467791032..4e7839875615 100644
> --- a/drivers/dma-buf/heaps/Makefile
> +++ b/drivers/dma-buf/heaps/Makefile
> @@ -1,3 +1,4 @@
>  # SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_DMABUF_HEAPS_DEFERRED_FREE) += deferred-free-helper.o
>  obj-$(CONFIG_DMABUF_HEAPS_SYSTEM)  += system_heap.o
>  obj-$(CONFIG_DMABUF_HEAPS_CMA) += cma_heap.o
> diff --git a/drivers/dma-buf/heaps/deferred-free-helper.c 
> b/drivers/dma-buf/heaps/deferred-free-helper.c
> new file mode 100644
> index ..b8f54860454f
> --- /dev/null
> +++ b/drivers/dma-buf/heaps/deferred-free-helper.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Deferred dmabuf freeing helper
> + *
> + * Copyright (C) 2020 Linaro, Ltd.
> + *
> + * Based on the ION page pool code
> + * Copyright (C) 2011 Google, Inc.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "deferred-free-helper.h"
> +
> +static LIST_HEAD(free_list);
> +static size_t list_size;
> +wait_queue_head_t freelist_waitqueue;
> +struct task_struct *freelist_task;
> +static DEFINE_MUTEX(free_list_lock);
> +
> +enum {
> +   USE_POOL = 0,
> +   SKIP_POOL = 1,
> +};

This enum is used for a bool parameter. Either make it part of the
public API or eliminate and use bool instead.

> +
> +void deferred_free(struct deferred_freelist_item *item,
> +  void (*free)(struct deferred_freelist_item*, bool),
> +  size_t size)
> +{
> +   INIT_LIST_HEAD(>list);
> +   item->size = size;
> +   item->free = free;
> +
> +   mutex_lock(_list_lock);
> +   list_add(>list, _list);
> +   list_size += size;
> +   mutex_unlock(_list_lock);
> +   wake_up(_waitqueue);
> +}
> +
> +static size_t free_one_item(bool nopool)
> +{
> +   size_t size = 0;
> +   struct deferred_freelist_item *item;
> +
> +   mutex_lock(_list_lock);
> +   if (list_empty(_list)) {
> +   mutex_unlock(_list_lock);
> +   return 0;
> +   }
> +   item = list_first_entry(_list, struct deferred_freelist_item, 
> list);
> +   list_del(>list);
> +   size = item->size;
> +   list_size -= size;
> +   mutex_unlock(_list_lock);
> +
> +   item->free(item, nopool);
> +   return size;
> +}
> +
> +static unsigned long get_freelist_size(void)
> +{
> +   unsigned long size;
> +
> +   mutex_lock(_list_lock);
> +

Re: [PATCH v5 00/15] An alternative series for asymmetric AArch32 systems

2020-12-16 Thread Suren Baghdasaryan

On Wed, Dec 16, 2020 at 8:48 AM Qais Yousef  wrote:
>
> On 12/16/20 14:14, Will Deacon wrote:
> > Hi Qais,
> >
> > On Wed, Dec 16, 2020 at 11:16:46AM +, Qais Yousef wrote:
> > > On 12/08/20 13:28, Will Deacon wrote:
> > > > Changes in v5 include:
> > > >
> > > >   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
> > > > we can avoid returning incompatible CPUs for a given task. This
> > > > means that sched_setaffinity() can be used with larger masks (like
> > > > the online mask) from userspace and also allows us to take into
> > > > account the cpuset hierarchy when forcefully overriding the affinity
> > > > for a task on execve().
> > > >
> > > >   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
> > > > so that the resulting affinity mask does not contain any 
> > > > incompatible
> > > > CPUs (since it would be rejected by set_cpus_allowed_ptr() 
> > > > otherwise).
> > > >
> > > >   * Moved overriding of the affinity mask into the scheduler core rather
> > > > than munge affinity masks directly in the architecture backend.
> > > >
> > > >   * Extended comments and documentation.
> > > >
> > > >   * Some renaming and cosmetic changes.
> > > >
> > > > I'm pretty happy with this now, although it still needs review and will
> > > > require rebasing to play nicely with the SCA changes in -next.
> > >
> > > I still have concerns about the cpuset v1 handling. Specifically:
> > >
> > > 1. Attaching a 32bit task to 64bit only cpuset is allowed.
> > >
> > >I think the right behavior here is to prevent that as the
> > >intersection will appear as offline cpus for the 32bit tasks. So it
> > >shouldn't be allowed to move there.
> >
> > Suren or Quantin can correct me if I'm wrong I'm here, but I think Android
> > relies on this working so it's not an option for us to prevent the attach.
>
> I don't think so. It's just a matter who handles the error. ie: kernel fix it
> up silently and effectively make the cpuset a NOP since we don't respect the
> affinity of the cpuset, or user space pick the next best thing. Since this
> could return an error anyway, likely user space already handles this.

Moving a 32bit task around the hierarchy when it lost the last 32bit
capable CPU in its affinity mask would not work for Android. We move
the tasks in the hierarchy only when they change their role
(background/foreground/etc) and does not expect the tasks to migrate
by themselves. I think the current approach of adjusting affinity
without migration while not ideal is much better. Consistency with
cgroup v2 is a big plus as well.
We do plan on moving cpuset controller to cgroup v2 but the transition
is slow, so my guess is that we will stick to it for another Android
release.

> > I also don't think it really achieves much, since as you point out, the same
> > problem exists in other cases such as execve() of a 32-bit binary, or
> > hotplugging off all 32-bit CPUs within a mixed cpuset. Allowing the attach
> > and immediately reparenting would probably be better, but see below.
>
> I am just wary that we're introducing a generic asymmetric ISA support, so my
> concerns have been related to making sure the behavior is sane generally. When
> this gets merged, I can bet more 'fun' hardware will appear all over the 
> place.
> We're opening the flood gates I'm afraid :p
>
> > > 2. Modifying cpuset.cpus could result with empty set for 32bit tasks.
> > >
> > >It is a variation of the above, it's just the cpuset transforms 
> > > into
> > >64bit only after we attach.
> > >
> > >I think the right behavior here is to move the 32bit tasks to the
> > >nearest ancestor like we do when all cpuset.cpus are hotplugged 
> > > out.
> > >
> > >We could too return an error if the new set will result an empty 
> > > set
> > >for the 32bit tasks. In a similar manner to how it fails if you
> > >write a cpu that is offline.
> > >
> > > 3. If a 64bit task belongs to 64bit-only-cpuset execs a 32bit binary,
> > >the 32 tasks will inherit the cgroup setting.
> > >
> > >Like above, we should move this to the nearest ancestor.
> >
> > I considered this when I was writing the patches, but the reality is that
> > by allowing 32-bit tasks to attach to a 64-bit only cpuset (which is 
> > required
> > by Android), we have no choice but to expose a new ABI to userspace. This is
> > all gated behind a command-line option, so I think that's fine, but then why
> > not just have the same behaviour as cgroup v2? I don't see the point in
> > creating two new ABIs (for cgroup v1 and v2 respectively) if we don't need
>
> Ultimately it's up to Tejun and Peter I guess. I thought we need to preserve
> the v1 behavior for the new class of tasks. I won't object to the new ABI
> myself. Maybe we just need to make the commit messages and cgroup-v1
> documentation reflect that explicitly.
>
> >

Re: [PATCH] psi: fix monitor for root cgroup

2020-12-08 Thread Suren Baghdasaryan

On Tue, Dec 8, 2020 at 12:35 AM Odin Ugedal  wrote:
>
> Fix NULL pointer dereference when adding new psi monitor to the root
> cgroup. PSI files for root cgroup was introduced in df5ba5be742 by using
> system wide psi struct when reading, but file write/monitor was not
> properly fixed. Since the PSI config for the root cgroup isn't
> initialized, the current implementation tries to lock a NULL ptr,
> resulting in a crash.
>
> Can be triggered by running this as root:
> $ tee /sys/fs/cgroup/cpu.pressure <<< "some 1 100"
>
>
> Signed-off-by: Odin Ugedal 
> ---
>  kernel/cgroup/cgroup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index e41c21819ba0..5d1fdf7c3ec6 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3567,6 +3567,7 @@ static ssize_t cgroup_pressure_write(struct 
> kernfs_open_file *of, char *buf,
>  {
> struct psi_trigger *new;
> struct cgroup *cgrp;
> +   struct psi_group *psi;
>
> cgrp = cgroup_kn_lock_live(of->kn, false);
> if (!cgrp)
> @@ -3575,7 +3576,8 @@ static ssize_t cgroup_pressure_write(struct 
> kernfs_open_file *of, char *buf,
> cgroup_get(cgrp);
> cgroup_kn_unlock(of->kn);
>
> -   new = psi_trigger_create(>psi, buf, nbytes, res);
> +   psi = cgroup_ino(cgrp) == 1 ? _system : >psi;
> +   new = psi_trigger_create(psi, buf, nbytes, res);
>     if (IS_ERR(new)) {
> cgroup_put(cgrp);
> return PTR_ERR(new);
> --
> 2.29.2
>

Reviewed-by: Suren Baghdasaryan

Re: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support

2020-12-08 Thread Suren Baghdasaryan

On Tue, Dec 8, 2020 at 3:40 PM Jann Horn  wrote:
>
> On Tue, Nov 24, 2020 at 6:50 AM Suren Baghdasaryan  wrote:
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running.
> > In such situation it is desirable to be able to free up the memory of the
> > process being killed in a more controlled way.
> > Enable MADV_DONTNEED to be used with process_madvise when applied to a
> > dying process to reclaim its memory. This would allow userspace system
> > components like oomd and lmkd to free memory of the target process in
> > a more predictable way.
> >
> > Signed-off-by: Suren Baghdasaryan 
> [...]
> > @@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> > struct iovec __user *, vec,
> > goto release_task;
> > }
> >
> > +   if (madvise_destructive(behavior)) {
> > +   /* Allow destructive madvise only on a dying processes */
> > +   if (!signal_group_exit(task->signal)) {
> > +   ret = -EINVAL;
> > +   goto release_mm;
> > +   }
>
> Technically Linux allows processes to share mm_struct without being in
> the same thread group, so I'm not sure whether this check is good
> enough? AFAICS the normal OOM killer deals with this case by letting
> __oom_kill_process() always kill all tasks that share the mm_struct.

Thanks for the comment Jann.
You are right. I think replacing !signal_group_exit(task->signal) with
task_will_free_mem(task) would address both your and Oleg's comments.
IIUC, task_will_free_mem() calls __task_will_free_mem() on the task
itself and on all processes sharing the mm_struct ensuring that they
are all dying.

Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-12-07 Thread Suren Baghdasaryan

On Mon, Nov 30, 2020 at 11:01 AM Suren Baghdasaryan  wrote:
>
> On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim  wrote:
> >
> > On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim  wrote:
> > > >
> > > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > > process_madvise requires a vector of address ranges to be provided for
> > > > > its operations. When an advice should be applied to the entire 
> > > > > process,
> > > > > the caller process has to obtain the list of VMAs of the target 
> > > > > process
> > > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > > operation grows linearly with increasing number of VMAs in the target
> > > > > process. Even constructing the input vector can be non-trivial when
> > > > > target process has several thousands of VMAs and the syscall is being
> > > > > issued during high memory pressure period when new allocations for 
> > > > > such
> > > > > a vector would only worsen the situation.
> > > > > In the case when advice is being applied to the entire memory space of
> > > > > the target process, this creates an extra overhead.
> > > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > > advise a memory range of the target process. For now, to keep it 
> > > > > simple,
> > > > > only the entire process memory range is supported, vec and vlen inputs
> > > > > in this mode are ignored and can be NULL and 0.
> > > > > Instead of returning the number of bytes that advice was successfully
> > > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > > to the fact that the number of bytes would not be useful for the 
> > > > > caller
> > > > > that does not know the amount of memory the call is supposed to 
> > > > > affect.
> > > > > Besides, the ssize_t return type can be too small to hold the number 
> > > > > of
> > > > > bytes affected when the operation is applied to a large memory range.
> > > >
> > > > Can we just use one element in iovec to indicate entire address rather
> > > > than using up the reserved flags?
> > > >
> > > > struct iovec {
> > > > .iov_base = NULL,
> > > > .iov_len = (~(size_t)0),
> > > > };
> > > >
> > > > Furthermore, it would be applied for other syscalls where have support
> > > > iovec if we agree on it.
> > > >
> > >
> > > The flag also changes the return value semantics. If we follow your
> > > suggestion we should also agree that in this mode the return value
> > > will be 0 on success and negative otherwise instead of the number of
> > > bytes madvise was applied to.
> >
> > Well, return value will depends on the each API. If the operation is
> > desruptive, it should return the right size affected by the API but
> > would be okay with 0 or error, otherwise.
>
> I'm fine with dropping the flag, I just thought with the flag it would
> be more explicit that this is a special mode operating on ranges. This
> way the patch also becomes simpler.
> Andrew, Michal, Christian, what do you think about such API? Should I
> change the API this way / keep the flag / change it in some other way?


Friendly ping to get some feedback on the proposed API please.

Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-11-30 Thread Suren Baghdasaryan

On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim  wrote:
>
> On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim  wrote:
> > >
> > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > process_madvise requires a vector of address ranges to be provided for
> > > > its operations. When an advice should be applied to the entire process,
> > > > the caller process has to obtain the list of VMAs of the target process
> > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > operation grows linearly with increasing number of VMAs in the target
> > > > process. Even constructing the input vector can be non-trivial when
> > > > target process has several thousands of VMAs and the syscall is being
> > > > issued during high memory pressure period when new allocations for such
> > > > a vector would only worsen the situation.
> > > > In the case when advice is being applied to the entire memory space of
> > > > the target process, this creates an extra overhead.
> > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > advise a memory range of the target process. For now, to keep it simple,
> > > > only the entire process memory range is supported, vec and vlen inputs
> > > > in this mode are ignored and can be NULL and 0.
> > > > Instead of returning the number of bytes that advice was successfully
> > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > to the fact that the number of bytes would not be useful for the caller
> > > > that does not know the amount of memory the call is supposed to affect.
> > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > bytes affected when the operation is applied to a large memory range.
> > >
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > > struct iovec {
> > > .iov_base = NULL,
> > > .iov_len = (~(size_t)0),
> > > };
> > >
> > > Furthermore, it would be applied for other syscalls where have support
> > > iovec if we agree on it.
> > >
> >
> > The flag also changes the return value semantics. If we follow your
> > suggestion we should also agree that in this mode the return value
> > will be 0 on success and negative otherwise instead of the number of
> > bytes madvise was applied to.
>
> Well, return value will depends on the each API. If the operation is
> desruptive, it should return the right size affected by the API but
> would be okay with 0 or error, otherwise.

I'm fine with dropping the flag, I just thought with the flag it would
be more explicit that this is a special mode operating on ranges. This
way the patch also becomes simpler.
Andrew, Michal, Christian, what do you think about such API? Should I
change the API this way / keep the flag / change it in some other way?

Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

2020-11-25 Thread Suren Baghdasaryan

On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim  wrote:
>
> On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > process_madvise requires a vector of address ranges to be provided for
> > its operations. When an advice should be applied to the entire process,
> > the caller process has to obtain the list of VMAs of the target process
> > by reading the /proc/pid/maps or some other way. The cost of this
> > operation grows linearly with increasing number of VMAs in the target
> > process. Even constructing the input vector can be non-trivial when
> > target process has several thousands of VMAs and the syscall is being
> > issued during high memory pressure period when new allocations for such
> > a vector would only worsen the situation.
> > In the case when advice is being applied to the entire memory space of
> > the target process, this creates an extra overhead.
> > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > advise a memory range of the target process. For now, to keep it simple,
> > only the entire process memory range is supported, vec and vlen inputs
> > in this mode are ignored and can be NULL and 0.
> > Instead of returning the number of bytes that advice was successfully
> > applied to, the syscall in this mode returns 0 on success. This is due
> > to the fact that the number of bytes would not be useful for the caller
> > that does not know the amount of memory the call is supposed to affect.
> > Besides, the ssize_t return type can be too small to hold the number of
> > bytes affected when the operation is applied to a large memory range.
>
> Can we just use one element in iovec to indicate entire address rather
> than using up the reserved flags?
>
> struct iovec {
> .iov_base = NULL,
> .iov_len = (~(size_t)0),
> };
>
> Furthermore, it would be applied for other syscalls where have support
> iovec if we agree on it.
>

The flag also changes the return value semantics. If we follow your
suggestion we should also agree that in this mode the return value
will be 0 on success and negative otherwise instead of the number of
bytes madvise was applied to.

> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  arch/alpha/include/uapi/asm/mman.h   |  4 ++
> >  arch/mips/include/uapi/asm/mman.h|  4 ++
> >  arch/parisc/include/uapi/asm/mman.h  |  4 ++
> >  arch/xtensa/include/uapi/asm/mman.h  |  4 ++
> >  fs/io_uring.c|  2 +-
> >  include/linux/mm.h   |  3 +-
> >  include/uapi/asm-generic/mman-common.h   |  4 ++
> >  mm/madvise.c | 47 +---
> >  tools/include/uapi/asm-generic/mman-common.h |  4 ++
> >  9 files changed, 67 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h 
> > b/arch/alpha/include/uapi/asm/mman.h
> > index a18ec7f63888..54588d2f5406 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -79,4 +79,8 @@
> >  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> >PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> > +
> >  #endif /* __ALPHA_MMAN_H__ */
> > diff --git a/arch/mips/include/uapi/asm/mman.h 
> > b/arch/mips/include/uapi/asm/mman.h
> > index 57dc2ac4f8bd..af94f38a3a9d 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -106,4 +106,8 @@
> >  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> >PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK  (PMADV_FLAG_RANGE)
> > +
> >  #endif /* _ASM_MMAN_H */
> > diff --git a/arch/parisc/include/uapi/asm/mman.h 
> > b/arch/parisc/include/uapi/asm/mman.h
> > index ab78cba446ed..ae644c493991 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -77,4 +77,8 @@
> >  #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> >PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */
> > +#define PMADV_FL

Re: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support

2020-11-24 Thread Suren Baghdasaryan

On Tue, Nov 24, 2020 at 5:42 AM Oleg Nesterov  wrote:
>
> On 11/23, Suren Baghdasaryan wrote:
> >
> > + if (madvise_destructive(behavior)) {
> > + /* Allow destructive madvise only on a dying processes */
> > + if (!signal_group_exit(task->signal)) {
>
> signal_group_exit(task) is true if this task execs and kills other threads,
> see the comment above this helper.
>
> I think you need !(task->signal->flags & SIGNAL_GROUP_EXIT).

I see. Thanks for the feedback, Oleg. I'll test and fix it in the next version.

>
> Oleg.
>

Re: [PATCH 1/1] RFC: add pidfd_send_signal flag to reclaim mm while killing a process

2020-11-23 Thread Suren Baghdasaryan

On Wed, Nov 18, 2020 at 4:13 PM Suren Baghdasaryan  wrote:
>
> On Wed, Nov 18, 2020 at 11:55 AM Suren Baghdasaryan  wrote:
> >
> > On Wed, Nov 18, 2020 at 11:51 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Wed, Nov 18, 2020 at 11:32 AM Michal Hocko  wrote:
> > > >
> > > > On Wed 18-11-20 11:22:21, Suren Baghdasaryan wrote:
> > > > > On Wed, Nov 18, 2020 at 11:10 AM Michal Hocko  wrote:
> > > > > >
> > > > > > On Fri 13-11-20 18:16:32, Andrew Morton wrote:
> > > > > > [...]
> > > > > > > It's all sounding a bit painful (but not *too* painful).  But to
> > > > > > > reiterate, I do think that adding the ability for a process to 
> > > > > > > shoot
> > > > > > > down a large amount of another process's memory is a lot more 
> > > > > > > generally
> > > > > > > useful than tying it to SIGKILL, agree?
>
> I was looking into how to work around the limitation of MAX_RW_COUNT
> and the conceptual issue there is the "struct iovec" which has its
> iov_len as size_t that lacks capacity for expressing ranges like
> "entire process memory". I would like to check your reaction to the
> following idea which can be implemented without painful surgeries to
> the import_iovec and its friends.
>
> process_madvise(pidfd, iovec = [ { range_start_addr, 0 }, {
> range_end_addr, 0 } ], vlen = 2, behavior=MADV_xxx, flags =
> PMADV_FLAG_RANGE)
>
> So, to represent a range we pass a new PMADV_FLAG_RANGE flag and
> construct a 2-element vector to express range start and range end
> using iovec.iov_base members. iov_len member of the iovec elements is
> ignored in this mode. I know it sounds hacky but I think it's the
> simplest way if we want the ability to express an arbitrarily large
> range.
> Another option is to do what Andrew described as "madvise((void *)0,
> (void *)-1, MADV_PAGEOUT)" which means this mode works only with the
> entire mm of the process.
> WDYT?
>

To follow up on this discussion, I posted a patchset to implement
process_madvise(MADV_DONTNEED) supporting the entire mm range at
https://lkml.org/lkml/2020/11/24/21.

> > > > > >
> > > > > > I am not sure TBH. Is there any reasonable usecase where 
> > > > > > uncoordinated
> > > > > > memory tear down is OK and a target process which is able to see the
> > > > > > unmapped memory?
> > > > >
> > > > > I think uncoordinated memory tear down is a special case which makes
> > > > > sense only when the target process is being killed (and we can enforce
> > > > > that by allowing MADV_DONTNEED to be used only if the target process
> > > > > has pending SIGKILL).
> > > >
> > > > That would be safe but then I am wondering whether it makes sense to
> > > > implement as a madvise call. It is quite strange to expect somebody call
> > > > a syscall on a killed process. But this is more a detail. I am not a
> > > > great fan of a more generic MADV_DONTNEED on a remote process. This is
> > > > just too dangerous IMHO.
> > >
> > > Agree 100%
> >
> > I assumed here that by "a more generic MADV_DONTNEED on a remote
> > process" you meant "process_madvise(MADV_DONTNEED) applied to a
> > process that is not being killed". Re-reading your comment I realized
> > that you might have meant "process_madvice() with generic support to
> > large memory areas". I hope I understood you correctly.
> >
> > >
> > > >
> > > > > However, the ability to apply other flavors of
> > > > > process_madvise() to large memory areas spanning multiple VMAs can be
> > > > > useful in more cases.
> > > >
> > > > Yes I do agree with that. The error reporting would be more tricky but
> > > > I am not really sure that the exact reporting is really necessary for
> > > > advice like interface.
> > >
> > > Andrew's suggestion for this special mode to change return semantics
> > > to the usual "0 or error code" seems to me like the most reasonable
> > > way to deal with the return value limitation.
> > >
> > > >
> > > > > For example in Android we will use
> > > > > process_madvise(MADV_PAGEOUT) to "shrink" an inactive background
> > > > > process.
> > > >
> > > > That makes sense to me.
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs

[PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support

2020-11-23 Thread Suren Baghdasaryan

In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory
quickly and efficiently. Unfortunately the time process takes to free
up its memory after receiving a SIGKILL might vary based on the state
of the process (uninterruptible sleep), size and OPP level of the core
the process is running.
In such situation it is desirable to be able to free up the memory of the
process being killed in a more controlled way.
Enable MADV_DONTNEED to be used with process_madvise when applied to a
dying process to reclaim its memory. This would allow userspace system
components like oomd and lmkd to free memory of the target process in
a more predictable way.

Signed-off-by: Suren Baghdasaryan 
---
 mm/madvise.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 1aa074a46524..11306534369e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -995,6 +996,18 @@ process_madvise_behavior_valid(int behavior)
switch (behavior) {
case MADV_COLD:
case MADV_PAGEOUT:
+   case MADV_DONTNEED:
+   return true;
+   default:
+   return false;
+   }
+}
+
+static bool madvise_destructive(int behavior)
+{
+   switch (behavior) {
+   case MADV_DONTNEED:
+   case MADV_FREE:
return true;
default:
return false;
@@ -1006,6 +1019,10 @@ static bool can_range_madv_lru_vma(struct vm_area_struct 
*vma, int behavior)
if (!can_madv_lru_vma(vma))
return false;
 
+   /* For destructive madvise skip shared file-backed VMAs */
+   if (madvise_destructive(behavior))
+   return vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED);
+
return true;
 }
 
@@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
struct iovec __user *, vec,
goto release_task;
}
 
+   if (madvise_destructive(behavior)) {
+   /* Allow destructive madvise only on a dying processes */
+   if (!signal_group_exit(task->signal)) {
+   ret = -EINVAL;
+   goto release_mm;
+   }
+   /* Ensure no competition with OOM-killer to avoid contention */
+   if (unlikely(mm_is_oom_victim(mm)) ||
+   unlikely(test_bit(MMF_OOM_SKIP, >flags))) {
+   /* Already being reclaimed */
+   ret = 0;
+   goto release_mm;
+   }
+   /* Mark mm as unstable */
+   set_bit(MMF_UNSTABLE, >flags);
+   }
+
/*
 * For range madvise only the entire address space is supported for now
 * and input iovec is ignored.
-- 
2.29.2.454.gaff20da3a2-goog

1 2 3 4 >

1 - 100 of 349 matches

Mail list logo