from:"Catalin Marinas"

Re: [PATCH v2 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS

2024-04-09 Thread Catalin Marinas

On Wed, Apr 03, 2024 at 04:38:00PM +0800, Kefeng Wang wrote:
> The vm_flags of vma already checked under per-VMA lock, if it is a
> bad access, directly set fault to VM_FAULT_BADACCESS and handle error,
> no need to retry with mmap_lock again, the latency time reduces 34% in
> 'lat_sig -P 1 prot lat_sig' from lmbench testcase.
> 
> Since the page faut is handled under per-VMA lock, count it as a vma lock
> event with VMA_LOCK_SUCCESS.
> 
> Reviewed-by: Suren Baghdasaryan 
> Signed-off-by: Kefeng Wang 

Reviewed-by: Catalin Marinas

Re: [PATCH v2 1/7] arm64: mm: cleanup __do_page_fault()

2024-04-09 Thread Catalin Marinas

On Wed, Apr 03, 2024 at 04:37:59PM +0800, Kefeng Wang wrote:
> The __do_page_fault() only calls handle_mm_fault() after vm_flags
> checked, and it is only called by do_page_fault(), let's squash
> it into do_page_fault() to cleanup code.
> 
> Reviewed-by: Suren Baghdasaryan 
> Signed-off-by: Kefeng Wang 

As I reviewed v1 and the changes are minimal:

Reviewed-by: Catalin Marinas

Re: [PATCH 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS

2024-04-03 Thread Catalin Marinas

On Tue, Apr 02, 2024 at 03:51:37PM +0800, Kefeng Wang wrote:
> The vm_flags of vma already checked under per-VMA lock, if it is a
> bad access, directly set fault to VM_FAULT_BADACCESS and handle error,
> no need to lock_mm_and_find_vma() and check vm_flags again, the latency
> time reduce 34% in lmbench 'lat_sig -P 1 prot lat_sig'.
> 
> Signed-off-by: Kefeng Wang 
> ---
>  arch/arm64/mm/fault.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 9bb9f395351a..405f9aa831bd 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -572,7 +572,9 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>  
>   if (!(vma->vm_flags & vm_flags)) {
>   vma_end_read(vma);
> - goto lock_mmap;
> + fault = VM_FAULT_BADACCESS;
> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> + goto done;
>   }
>   fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, 
> regs);
>   if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))

I think this makes sense. A concurrent modification of vma->vm_flags
(e.g. mprotect()) would do a vma_start_write(), so no need to recheck
again with the mmap lock held.

Reviewed-by: Catalin Marinas

Re: [PATCH 1/7] arm64: mm: cleanup __do_page_fault()

2024-04-03 Thread Catalin Marinas

On Tue, Apr 02, 2024 at 03:51:36PM +0800, Kefeng Wang wrote:
> The __do_page_fault() only check vma->flags and call handle_mm_fault(),
> and only called by do_page_fault(), let's squash it into do_page_fault()
> to cleanup code.
> 
> Signed-off-by: Kefeng Wang 

Reviewed-by: Catalin Marinas

Re: [PATCH 4/4] vdso: avoid including asm/page.h

2024-02-27 Thread Catalin Marinas

On Mon, Feb 26, 2024 at 05:14:14PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The recent change to the vdso_data_store broke building compat VDSO
> on at least arm64 because it includes headers outside of the include/vdso/
> namespace:
> 
> In file included from arch/arm64/include/asm/lse.h:5,
>  from arch/arm64/include/asm/cmpxchg.h:14,
>  from arch/arm64/include/asm/atomic.h:16,
>  from include/linux/atomic.h:7,
>  from include/asm-generic/bitops/atomic.h:5,
>  from arch/arm64/include/asm/bitops.h:25,
>  from include/linux/bitops.h:68,
>  from arch/arm64/include/asm/memory.h:209,
>  from arch/arm64/include/asm/page.h:46,
>  from include/vdso/datapage.h:22,
>  from lib/vdso/gettimeofday.c:5,
>  from :
> arch/arm64/include/asm/atomic_ll_sc.h:298:9: error: unknown type name 'u128'
>   298 | u128 full;
> 
> Use an open-coded page size calculation based on the new CONFIG_PAGE_SHIFT
> Kconfig symbol instead.
> 
> Reported-by: Linux Kernel Functional Testing 
> Fixes: a0d2fcd62ac2 ("vdso/ARM: Make union vdso_data_store available for all 
> architectures")
> Link: 
> https://lore.kernel.org/lkml/ca+g9fytrxxm_ko9fnpz3xarxhv7ud_yqp-teupqrnrhu+_0...@mail.gmail.com/
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas

Re: [PATCH 2/4] arch: simplify architecture specific page size configuration

2024-02-27 Thread Catalin Marinas

On Mon, Feb 26, 2024 at 05:14:12PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> arc, arm64, parisc and powerpc all have their own Kconfig symbols
> in place of the common CONFIG_PAGE_SIZE_4KB symbols. Change these
> so the common symbols are the ones that are actually used, while
> leaving the arhcitecture specific ones as the user visible
> place for configuring it, to avoid breaking user configs.
> 
> Signed-off-by: Arnd Bergmann 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-19 Thread Catalin Marinas

On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote:
> On 16/02/2024 12:25, Catalin Marinas wrote:
> > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >> +{
> >> +  /*
> >> +   * Gather access/dirty bits, which may be populated in any of the ptes
> >> +   * of the contig range. We may not be holding the PTL, so any contiguous
> >> +   * range may be unfolded/modified/refolded under our feet. Therefore we
> >> +   * ensure we read a _consistent_ contpte range by checking that all ptes
> >> +   * in the range are valid and have CONT_PTE set, that all pfns are
> >> +   * contiguous and that all pgprots are the same (ignoring access/dirty).
> >> +   * If we find a pte that is not consistent, then we must be racing with
> >> +   * an update so start again. If the target pte does not have CONT_PTE
> >> +   * set then that is considered consistent on its own because it is not
> >> +   * part of a contpte range.
> >> +*/
[...]
> > After writing the comments above, I think I figured out that the whole
> > point of this loop is to check that the ptes in the contig range are
> > still consistent and the only variation allowed is the dirty/young
> > state to be passed to the orig_pte returned. The original pte may have
> > been updated by the time this loop finishes but I don't think it
> > matters, it wouldn't be any different than reading a single pte and
> > returning it while it is being updated.
> 
> Correct. The pte can be updated at any time, before after or during the reads.
> That was always the case. But now we have to cope with a whole contpte block
> being repainted while we are reading it. So we are just checking to make sure
> that all the ptes that we read from the contpte block are consistent with
> eachother and therefore we can trust that the access/dirty bits we gathered 
> are
> consistent.

I've been thinking a bit more about this - do any of the callers of
ptep_get_lockless() check the dirty/access bits? The only one that seems
to care is ptdump but in that case I'd rather see the raw bits for
debugging rather than propagating the dirty/access bits to the rest in
the contig range.

So with some clearer documentation on the requirements, I think we don't
need an arm64-specific ptep_get_lockless() (unless I missed something).

-- 
Catalin

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread Catalin Marinas

On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote:
> On 16/02/2024 12:25, Catalin Marinas wrote:
> > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> >>  arch/arm64/mm/contpte.c  | 285 +++
> > 
> > Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL().
> > We don't expect them to be used by random out of tree modules. In fact,
> > do we expect them to end up in modules at all? Most seem to be called
> > from the core mm code.
> 
> The problem is that the contpte_* symbols are called from the ptep_* inline
> functions. So where those inlines are called from modules, we need to make 
> sure
> the contpte_* symbols are available.
> 
> John Hubbard originally reported this problem against v1 and I enumerated all
> the drivers that call into the ptep_* inlines here:
> https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t
> 
> So they definitely need to be exported. Perhaps we can tighten it to
> EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break 
> anything
> out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal
> amounts of both.

I don't think we are consistent here. For example set_pte_at() can't be
called from non-GPL modules because of __sync_icache_dcache. OTOH, such
driver is probably doing something dodgy. Same with
apply_to_page_range(), it's GPL-only (called from i915).

Let's see if others have any view over the next week or so, otherwise
I'd go for _GPL and relax it later if someone has a good use-case (can
be a patch on top adding _GPL).

> > If you can make this easier to parse (in a few years time) with an
> > additional patch adding some more comments, that would be great. For
> > this patch:
> 
> I already have a big block comment at the top, which was trying to explain it.
> Clearly not well enough though. I'll add more comments as a follow up patch 
> when
> I get back from holiday.

I read that comment but it wasn't immediately obvious what the atomicity
requirements are - basically we require a single PTE to be atomically
read (which it is), the rest is the dirty/young state being added on
top. I guess a sentence along these lines would do.

Enjoy your holiday!

-- 
Catalin

Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

2024-02-16 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:32:05AM +, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf

2024-02-16 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:32:04AM +, Ryan Roberts wrote:
> As set_ptes() and wrprotect_ptes() become a bit more complex, the
> compiler may choose not to inline them. But this is critical for fork()
> performance. So mark the functions, along with contpte_try_unfold()
> which is called by them, as __always_inline. This is worth ~1% on the
> fork() microbenchmark with order-0 folios (the common case).
> 
> Acked-by: Mark Rutland 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 16/18] arm64/mm: Implement pte_batch_hint()

2024-02-16 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:32:03AM +, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Acked-by: Mark Rutland 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-16 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:32:01AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-16 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:32:00AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
> continues to benefit from the more efficient use of the TLB after
> the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/ja-07/
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
>  arch/arm64/mm/contpte.c  | 285 +++

Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL().
We don't expect them to be used by random out of tree modules. In fact,
do we expect them to end up in modules at all? Most seem to be called
from the core mm code.

> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> + pte_t pte = __ptep_get(ptep);
> +
> + if (likely(!pte_valid_cont(pte)))
> + return pte;
> +
> + return contpte_ptep_get_lockless(ptep);
> +}
[...]
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> + /*
> +  * Gather access/dirty bits, which may be populated in any of the ptes
> +  * of the contig range. We may not be holding the PTL, so any contiguous
> +  * range may be unfolded/modified/refolded under our feet. Therefore we
> +  * ensure we read a _consistent_ contpte range by checking that all ptes
> +  * in the range are valid and have CONT_PTE set, that all pfns are
> +  * contiguous and that all pgprots are the same (ignoring access/dirty).
> +  * If we find a pte that is not consistent, then we must be racing with
> +  * an update so start again. If the target pte does not have CONT_PTE
> +  * set then that is considered consistent on its own because it is not
> +  * part of a contpte range.
> +*/

I can't get my head around this lockless API. Maybe it works fine (and
may have been discussed already) but we should document what the races
are, why it works, what the memory ordering requirements are. For
example, the generic (well, x86 PAE) ptep_get_lockless() only needs to
ensure that the low/high 32 bits of a pte are consistent and there are
some ordering rules on how these are updated.

Does the arm64 implementation only need to be correct w.r.t. the
access/dirty bits? Since we can read orig_ptep atomically, I assume the
only other updates from unfolding would set the dirty/access bits.

> +
> + pgprot_t orig_prot;
> + unsigned long pfn;
> + pte_t orig_pte;
> + pgprot_t prot;
> + pte_t *ptep;
> + pte_t pte;
> + int i;
> +
> +retry:
> + orig_pte = __ptep_get(orig_ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return orig_pte;
> +
> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> + ptep = contpte_align_down(orig_ptep);
> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));

We don't have any ordering guarantees in how the ptes in this range are
read or written in the contpte_set_ptes() and the fold/unfold functions.
We might not need them given all the other checks below but it's worth
adding a comment.

> +
> + if (!pte_valid_cont(pte) ||
> +pte_pfn(pte) != pfn ||
> +pgprot_val(prot) != pgprot_val(orig_prot))
> + goto retry;

I think this also needs some comment. I get the !pte_valid_cont() check
to attempt retrying when racing with unfolding. Are the other checks
needed to detect re-folding with different protection or pfn?

> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }

After writing the comments above, I think I figured out that the whole
point of this loop is to check that the ptes in the contig range are
still consistent and the only variation allowed is the dirty/young
state to be passed to the orig_pte returned. The original pte may have
been updated by the time this loop finishes but I don't think it
matters, it wouldn't be any different than reading a single pte and
returning it while it is being updated.

If you can make this easier to parse (in a few years time) with an
additional patch adding some more comments, that would be great. For
this patch:

Reviewed-by: Catalin Marinas 

-- 
Catalin

Re: [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:58AM +, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs()
> has changed, but now aligns with the ordering of __flush_tlb_page(). It
> has been discussed that __flush_tlb_page() may be wrong though.
> Regardless, both will be resolved separately if needed.
> 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:57AM +, Ryan Roberts wrote:
> Create a new layer for the in-table PTE manipulation APIs. For now, The
> existing API is prefixed with double underscore to become the
> arch-private API and the public API is just a simple wrapper that calls
> the private API.
> 
> The public API implementation will subsequently be used to transparently
> manipulate the contiguous bit where appropriate. But since there are
> already some contig-aware users (e.g. hugetlb, kernel mapper), we must
> first ensure those users use the private API directly so that the future
> contig-bit manipulations in the public API do not interfere with those
> existing uses.
> 
> The following APIs are treated this way:
> 
>  - ptep_get
>  - set_pte
>  - set_ptes
>  - pte_clear
>  - ptep_get_and_clear
>  - ptep_test_and_clear_young
>  - ptep_clear_flush_young
>  - ptep_set_wrprotect
>  - ptep_set_access_flags
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:56AM +, Ryan Roberts wrote:
> ptep_clear() is a generic wrapper around the arch-implemented
> ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into
> a public version and private version (__ptep_get_and_clear()) to support
> the transparent contpte work. We won't have a private version of
> ptep_clear() so let's convert it to directly call ptep_get_and_clear().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:55AM +, Ryan Roberts wrote:
> Since set_ptes() was introduced, set_pte_at() has been implemented as a
> generic macro around set_ptes(..., 1). So this change should continue to
> generate the same code. However, making this change prepares us for the
> transparent contpte support. It means we can reroute set_ptes() to
> __set_ptes(). Since set_pte_at() is a generic macro, there will be no
> equivalent __set_pte_at() to reroute to.
> 
> Note that a couple of calls to set_pte_at() remain in the arch code.
> This is intentional, since those call sites are acting on behalf of
> core-mm and should continue to call into the public set_ptes() rather
> than the arch-private __set_ptes().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:54AM +, Ryan Roberts wrote:
> There are a number of places in the arch code that read a pte by using
> the READ_ONCE() macro. Refactor these call sites to instead use the
> ptep_get() helper, which itself is a READ_ONCE(). Generated code should
> be the same.
> 
> This will benefit us when we shortly introduce the transparent contpte
> support. In this case, ptep_get() will become more complex so we now
> have all the code abstracted through it.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:51AM +, Ryan Roberts wrote:
> Core-mm needs to be able to advance the pfn by an arbitrary amount, so
> override the new pte_advance_pfn() API to do so.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: get_user_pages() and EXEC_ONLY mapping.

2023-11-10 Thread Catalin Marinas

On Fri, Nov 10, 2023 at 08:19:23PM +0530, Aneesh Kumar K.V wrote:
> Some architectures can now support EXEC_ONLY mappings and I am wondering
> what get_user_pages() on those addresses should return. Earlier
> PROT_EXEC implied PROT_READ and pte_access_permitted() returned true for
> that. But arm64 does have this explicit comment that says
> 
>  /*
>  * p??_access_permitted() is true for valid user mappings (PTE_USER
>  * bit set, subject to the write permission check). For execute-only
>  * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
>  * not set) must return false. PROT_NONE mappings do not have the
>  * PTE_VALID bit set.
>  */
> 
> Is that correct? We should be able to get struct page for PROT_EXEC
> mappings?

I don't remember why we ended up with this briefly looking at the code,
pte_access_permitted() is only used on the fast GUP path. On the slow
path, there is a check_vma_flags() call which returns -EFAULT if the vma
is not readable. So the pte_access_permitted() on the fast path matches
the semantics of the slow path.

If one wants the page structure, FOLL_FORCE ignores the read check (on
the slow path), though I think it still fails if VM_MAYREAD is not set.
Unless you have a real use-case where this is not sufficient, I'd leave
the behaviour as is on arm64 (and maybe update other architectures that
support exec-only to do the same).

-- 
Catalin

Re: [PATCH v3 9/9] efi: move screen_info into efi init code

2023-10-10 Thread Catalin Marinas

On Mon, Oct 09, 2023 at 11:18:45PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> After the vga console no longer relies on global screen_info, there are
> only two remaining use cases:
> 
>  - on the x86 architecture, it is used for multiple boot methods
>(bzImage, EFI, Xen, kexec) to commucate the initial VGA or framebuffer
>settings to a number of device drivers.
> 
>  - on other architectures, it is only used as part of the EFI stub,
>and only for the three sysfb framebuffers (simpledrm, simplefb, efifb).
> 
> Remove the duplicate data structure definitions by moving it into the
> efi-init.c file that sets it up initially for the EFI case, leaving x86
> as an exception that retains its own definition for non-EFI boots.
> 
> The added #ifdefs here are optional, I added them to further limit the
> reach of screen_info to configurations that have at least one of the
> users enabled.
> 
> Reviewed-by: Ard Biesheuvel 
> Reviewed-by: Javier Martinez Canillas 
> Acked-by: Helge Deller 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arm/kernel/setup.c   |  4 
>  arch/arm64/kernel/efi.c   |  4 
>  arch/arm64/kernel/image-vars.h|  2 ++

It's more Ard's thing and he reviewed it already but if you need another
ack:

Acked-by: Catalin Marinas

Re: [PATCH v2] arch: Reserve map_shadow_stack() syscall number for all architectures

2023-10-04 Thread Catalin Marinas

On Thu, Sep 14, 2023 at 06:58:03PM +, Sohil Mehta wrote:
> commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall")
> recently added support for map_shadow_stack() but it is limited to x86
> only for now. There is a possibility that other architectures (namely,
> arm64 and RISC-V), that are implementing equivalent support for shadow
> stacks, might need to add support for it.
> 
> Independent of that, reserving arch-specific syscall numbers in the
> syscall tables of all architectures is good practice and would help
> avoid future conflicts. map_shadow_stack() is marked as a conditional
> syscall in sys_ni.c. Adding it to the syscall tables of other
> architectures is harmless and would return ENOSYS when exercised.
> 
> Note, map_shadow_stack() was assigned #453 during the merge process
> since #452 was taken by fchmodat2().
> 
> For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc
> syscall tables.
> 
> For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from
> the common syscall numbering system in the other architectures.
> 
> Link: 
> https://lore.kernel.org/lkml/20230515212255.ga562...@debug.ba.rivosinc.com/
> Link: 
> https://lore.kernel.org/lkml/b402b80b-a7c6-4ef0-b977-c0f5f582b...@sirena.org.uk/
> 
> Signed-off-by: Sohil Mehta 
> ---
> v2:
> - Skip syscall table changes to tools/. They will be handled separetely by the
>   perf folks.
> - Map Powerpc to sys_ni_syscall (Rick Edgecombe)
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
>  arch/arm/tools/syscall.tbl  | 1 +
>  arch/arm64/include/asm/unistd.h     | 2 +-
>  arch/arm64/include/asm/unistd32.h   | 2 ++

For arm64 (compat):

Acked-by: Catalin Marinas

Re: [PATCH v1 0/8] Fix set_huge_pte_at() panic on arm64

2023-09-21 Thread Catalin Marinas

On Thu, Sep 21, 2023 at 05:35:54PM +0100, Ryan Roberts wrote:
> On 21/09/2023 17:30, Andrew Morton wrote:
> > On Thu, 21 Sep 2023 17:19:59 +0100 Ryan Roberts  
> > wrote:
> >> Ryan Roberts (8):
> >>   parisc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   powerpc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   riscv: hugetlb: Convert set_huge_pte_at() to take vma
> >>   s390: hugetlb: Convert set_huge_pte_at() to take vma
> >>   sparc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   mm: hugetlb: Convert set_huge_pte_at() to take vma
> >>   arm64: hugetlb: Convert set_huge_pte_at() to take vma
> >>   arm64: hugetlb: Fix set_huge_pte_at() to work with all swap entries
> >>
> >>  arch/arm64/include/asm/hugetlb.h  |  2 +-
> >>  arch/arm64/mm/hugetlbpage.c   | 22 --
> >>  arch/parisc/include/asm/hugetlb.h |  2 +-
> >>  arch/parisc/mm/hugetlbpage.c  |  4 +--
> >>  .../include/asm/nohash/32/hugetlb-8xx.h   |  3 +-
> >>  arch/powerpc/mm/book3s64/hugetlbpage.c|  2 +-
> >>  arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |  2 +-
> >>  arch/powerpc/mm/nohash/8xx.c  |  2 +-
> >>  arch/powerpc/mm/pgtable.c |  7 -
> >>  arch/riscv/include/asm/hugetlb.h  |  2 +-
> >>  arch/riscv/mm/hugetlbpage.c   |  3 +-
> >>  arch/s390/include/asm/hugetlb.h   |  8 +++--
> >>  arch/s390/mm/hugetlbpage.c|  8 -
> >>  arch/sparc/include/asm/hugetlb.h  |  8 +++--
> >>  arch/sparc/mm/hugetlbpage.c   |  8 -
> >>  include/asm-generic/hugetlb.h |  6 ++--
> >>  include/linux/hugetlb.h   |  6 ++--
> >>  mm/damon/vaddr.c  |  2 +-
> >>  mm/hugetlb.c  | 30 +--
> >>  mm/migrate.c  |  2 +-
> >>  mm/rmap.c | 10 +++
> >>  mm/vmalloc.c  |  5 +++-
> >>  22 files changed, 80 insertions(+), 64 deletions(-)
> > 
> > Looks scary but it's actually a fairly modest patchset.  It could
> > easily be all rolled into a single patch for ease of backporting. 
> > Maybe Greg has an opinion?
> 
> Yes, I thought about doing that; or perhaps 2 patches - one for the interface
> change across all arches and core code, and one for the actual bug fix?

I think this would make more sense, especially if we want to backport
it. The first patch would have no functional change, only an interface
change, followed by the arm64 fix.

-- 
Catalin

Re: [PATCH v3 5/5] mmu_notifiers: Rename invalidate_range notifier

2023-07-21 Thread Catalin Marinas

On Thu, Jul 20, 2023 at 06:39:27PM +1000, Alistair Popple wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index a99349d..84a05a0 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   __tlbi(aside1is, asid);
>   __tlbi_user(aside1is, asid);
>   dsb(ish);
> - mmu_notifier_invalidate_range(mm, 0, -1UL);
> + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> @@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct 
> mm_struct *mm,
>   addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
> - mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
> + mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>   (uaddr & PAGE_MASK) + 
> PAGE_SIZE);
>  }
>  
> @@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct 
> vm_area_struct *vma,
>   scale++;
>   }
>   dsb(ish);
> - mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>  }

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v3 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-21 Thread Catalin Marinas

On Thu, Jul 20, 2023 at 06:39:25PM +1000, Alistair Popple wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 3456866..a99349d 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   __tlbi(aside1is, asid);
>   __tlbi_user(aside1is, asid);
>   dsb(ish);
> + mmu_notifier_invalidate_range(mm, 0, -1UL);
>  }
>  
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> @@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct 
> mm_struct *mm,
>   addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
> + mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
> + (uaddr & PAGE_MASK) + 
> PAGE_SIZE);

Nitpick: we have PAGE_ALIGN() for this.

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v11 4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-21 Thread Catalin Marinas

On Mon, Jul 17, 2023 at 09:10:04PM +0800, Yicong Yang wrote:
> +static inline void arch_tlbbatch_add_pending(struct 
> arch_tlbflush_unmap_batch *batch,
> +  struct mm_struct *mm,
> +  unsigned long uaddr)
> +{
> + __flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> + dsb(ish);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
> *batch)
> +{
> + dsb(ish);
> +}

Nitpick: as an additional patch, I'd add some comment for these two
functions that the TLBI has already been issued and only a DSB is needed
to synchronise its effect on the other CPUs.

Reviewed-by: Catalin Marinas

Re: [PATCH v11 3/4] mm/tlbbatch: Introduce arch_flush_tlb_batched_pending()

2023-07-21 Thread Catalin Marinas

On Mon, Jul 17, 2023 at 09:10:03PM +0800, Yicong Yang wrote:
> From: Yicong Yang 
> 
> Currently we'll flush the mm in flush_tlb_batched_pending() to
> avoid race between reclaim unmaps pages by batched TLB flush
> and mprotect/munmap/etc. Other architectures like arm64 may
> only need a synchronization barrier(dsb) here rather than
> a full mm flush. So add arch_flush_tlb_batched_pending() to
> allow an arch-specific implementation here. This intends no
> functional changes on x86 since still a full mm flush for
> x86.
> 
> Signed-off-by: Yicong Yang 

Reviewed-by: Catalin Marinas

Re: [PATCH v11 2/4] mm/tlbbatch: Rename and extend some functions

2023-07-21 Thread Catalin Marinas

On Mon, Jul 17, 2023 at 09:10:02PM +0800, Yicong Yang wrote:
> From: Barry Song 
> 
> This patch does some preparation works to extend batched TLB flush to
> arm64. Including:
> - Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm()
>   to accept an additional argument for address, architectures
>   like arm64 may need this for tlbi.
> - Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending()
>   to match its current function since we don't need to handle
>   mm on architectures like arm64 and add_mm is not proper,
>   add_pending will make sense to both as on x86 we're pending the
>   TLB flush operations while on arm64 we're pending the synchronize
>   operations.
> 
> This intends no functional changes on x86.
> 
> Cc: Anshuman Khandual 
> Cc: Jonathan Corbet 
> Cc: Nadav Amit 
> Cc: Mel Gorman 
> Tested-by: Yicong Yang 
> Tested-by: Xin Hao 
> Tested-by: Punit Agrawal 
> Signed-off-by: Barry Song 
> Signed-off-by: Yicong Yang 
> Reviewed-by: Kefeng Wang 
> Reviewed-by: Xin Hao 
> Reviewed-by: Anshuman Khandual 

Reviewed-by: Catalin Marinas

Re: [PATCH v11 1/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

2023-07-21 Thread Catalin Marinas

On Mon, Jul 17, 2023 at 09:10:01PM +0800, Yicong Yang wrote:
> From: Anshuman Khandual 
> 
> The entire scheme of deferred TLB flush in reclaim path rests on the
> fact that the cost to refill TLB entries is less than flushing out
> individual entries by sending IPI to remote CPUs. But architecture
> can have different ways to evaluate that. Hence apart from checking
> TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
> architecture specific.
> 
> Signed-off-by: Anshuman Khandual 
> [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/]
> Signed-off-by: Yicong Yang 
> [Rebase and fix incorrect return value type]
> Reviewed-by: Kefeng Wang 
> Reviewed-by: Anshuman Khandual 
> Reviewed-by: Barry Song 
> Reviewed-by: Xin Hao 
> Tested-by: Punit Agrawal 

Reviewed-by: Catalin Marinas

Re: [PATCH v10 4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-16 Thread Catalin Marinas

On Mon, Jul 10, 2023 at 04:39:14PM +0800, Yicong Yang wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7856c3a3e35a..f0ce8208c57f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -96,6 +96,7 @@ config ARM64
>   select ARCH_SUPPORTS_NUMA_BALANCING
>   select ARCH_SUPPORTS_PAGE_TABLE_CHECK
>   select ARCH_SUPPORTS_PER_VMA_LOCK
> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT

I don't want EXPERT to turn on a feature that's not selectable by the
user. This would lead to different performance behaviour based on
EXPERT. Just select it unconditionally.

> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..4bb9cec62e26 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   dsb(ish);
>  }
>  
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> -  unsigned long uaddr)
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> +unsigned long uaddr)
>  {
>   unsigned long addr;
>  
>   dsb(ishst);
> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
>  }
>  
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +  unsigned long uaddr)
> +{
> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>  static inline void flush_tlb_page(struct vm_area_struct *vma,
> unsigned long uaddr)
>  {
> @@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct 
> *vma,
>   dsb(ish);
>  }
>  
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH

If it's selected unconditionally, we won't need this #ifdef here.

> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> + /*
> +  * TLB flush deferral is not required on systems, which are affected 
> with

"affected by" and drop the comma before "which".

-- 
Catalin

Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas

On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> > From: Barry Song 
> > 
> > on x86, batched and deferred tlb shootdown has lead to 90%
> > performance increase on tlb shootdown. on arm64, HW can do
> > tlb shootdown without software IPI. But sync tlbi is still
> > quite expensive.
> [...]
> >  .../features/vm/TLB/arch-support.txt  |  2 +-
> >  arch/arm64/Kconfig|  1 +
> >  arch/arm64/include/asm/tlbbatch.h | 12 
> >  arch/arm64/include/asm/tlbflush.h | 33 -
> >  arch/arm64/mm/flush.c | 69 +++
> >  arch/x86/include/asm/tlbflush.h   |  5 +-
> >  include/linux/mm_types_task.h |  4 +-
> >  mm/rmap.c | 12 ++--
> 
> First of all, this patch needs to be split in some preparatory patches
> introducing/renaming functions with no functional change for x86. Once
> done, you can add the arm64-only changes.
> 
> Now, on the implementation, I had some comments on v7 but we didn't get
> to a conclusion and the thread eventually died:
> 
> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
> 
> I know I said a command line argument is better than Kconfig or some
> random number of CPUs heuristics but it would be even better if we don't
> bother with any, just make this always on. Barry had some comments
> around mprotect() being racy and that's why we have
> flush_tlb_batched_pending() but I don't think it's needed (or, for
> arm64, it can be a DSB since this patch issues the TLBIs but without the
> DVM Sync). So we need to clarify this (see Barry's last email on the
> above thread) and before attempting new versions of this patchset. With
> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
> implementation would be faster on any SoC irrespective of the number of
> CPUs.

I think I got the need for flush_tlb_batched_pending(). If
try_to_unmap() marks the pte !present and we have a pending TLBI,
change_pte_range() will skip the TLB maintenance altogether since it did
not change the pte. So we could be left with stale TLB entries after
mprotect() before TTU does the batch flushing.

We can have an arch-specific flush_tlb_batched_pending() that can be a
DSB only on arm64 and a full mm flush on x86.

-- 
Catalin

Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas

On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> From: Barry Song 
> 
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
[...]
>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/include/asm/tlbbatch.h | 12 
>  arch/arm64/include/asm/tlbflush.h | 33 -
>  arch/arm64/mm/flush.c | 69 +++
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 12 ++--

First of all, this patch needs to be split in some preparatory patches
introducing/renaming functions with no functional change for x86. Once
done, you can add the arm64-only changes.

Now, on the implementation, I had some comments on v7 but we didn't get
to a conclusion and the thread eventually died:

https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/

I know I said a command line argument is better than Kconfig or some
random number of CPUs heuristics but it would be even better if we don't
bother with any, just make this always on. Barry had some comments
around mprotect() being racy and that's why we have
flush_tlb_batched_pending() but I don't think it's needed (or, for
arm64, it can be a DSB since this patch issues the TLBIs but without the
DVM Sync). So we need to clarify this (see Barry's last email on the
above thread) and before attempting new versions of this patchset. With
flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
implementation would be faster on any SoC irrespective of the number of
CPUs.

-- 
Catalin

Re: [PATCH v4 21/34] arm64: Convert various functions to use ptdescs

2023-06-14 Thread Catalin Marinas

On Mon, Jun 12, 2023 at 02:04:10PM -0700, Vishal Moola (Oracle) wrote:
> As part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents, convert various page table functions to use ptdescs.
> 
> Signed-off-by: Vishal Moola (Oracle) 

Acked-by: Catalin Marinas

Re: [PATCH 0/3] Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas

On Tue, Jun 13, 2023 at 04:42:40PM +, Christophe Leroy wrote:
> 
> 
> Le 13/06/2023 à 17:52, Catalin Marinas a écrit :
> > Hi,
> > 
> > The ARCH_KMALLOC_MINALIGN reduction series defines a generic
> > ARCH_DMA_MINALIGN in linux/cache.h:
> > 
> > https://lore.kernel.org/r/20230612153201.554742-2-catalin.mari...@arm.com/
> > 
> > Unfortunately, this causes a duplicate definition warning for
> > microblaze, powerpc (32-bit only) and sh as these architectures define
> > ARCH_DMA_MINALIGN in a different file than asm/cache.h. Move the macro
> > to asm/cache.h to avoid this issue and also bring them in line with the
> > other architectures.
> 
> What about mips ?
> 
> arch/mips/include/asm/mach-generic/kmalloc.h:#define ARCH_DMA_MINALIGN
> 128
> arch/mips/include/asm/mach-ip32/kmalloc.h:#define ARCH_DMA_MINALIGN   32
> arch/mips/include/asm/mach-ip32/kmalloc.h:#define ARCH_DMA_MINALIGN   128
> arch/mips/include/asm/mach-n64/kmalloc.h:#define ARCH_DMA_MINALIGN
> L1_CACHE_BYTES
> arch/mips/include/asm/mach-tx49xx/kmalloc.h:#define ARCH_DMA_MINALIGN 
> L1_CACHE_BYTES

Sorry, I should have mentioned it in the cover letter (discussed here -
https://lore.kernel.org/r/zihpaixb%2f0ve7...@arm.com/). These kmalloc.h
files are included in asm/cache.h, based on which machine is enabled, so
there's no problem for mips. It makes more sense to keep them in those
mach-*/kmalloc.h files instead of having lots of #ifdefs in cache.h.

-- 
Catalin

[PATCH 2/3] microblaze: Move the ARCH_{DMA,SLAB}_MINALIGN definitions to asm/cache.h

2023-06-13 Thread Catalin Marinas

The microblaze architecture defines ARCH_DMA_MINALIGN in asm/page.h.
Move it to asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition
in linux/cache.h without redefine errors/warnings.

While at it, also move ARCH_SLAB_MINALIGN to asm/cache.h for
consistency.

Signed-off-by: Catalin Marinas 
Cc: Michal Simek 
---
 arch/microblaze/include/asm/cache.h | 5 +
 arch/microblaze/include/asm/page.h  | 5 -
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/microblaze/include/asm/cache.h 
b/arch/microblaze/include/asm/cache.h
index a149b3e711ec..1903988b9e23 100644
--- a/arch/microblaze/include/asm/cache.h
+++ b/arch/microblaze/include/asm/cache.h
@@ -18,4 +18,9 @@
 
 #define SMP_CACHE_BYTESL1_CACHE_BYTES
 
+/* MS be sure that SLAB allocates aligned objects */
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+
+#define ARCH_SLAB_MINALIGN L1_CACHE_BYTES
+
 #endif /* _ASM_MICROBLAZE_CACHE_H */
diff --git a/arch/microblaze/include/asm/page.h 
b/arch/microblaze/include/asm/page.h
index 7b9861bcd458..337f23eabc71 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -30,11 +30,6 @@
 
 #ifndef __ASSEMBLY__
 
-/* MS be sure that SLAB allocates aligned objects */
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-
-#define ARCH_SLAB_MINALIGN L1_CACHE_BYTES
-
 /*
  * PAGE_OFFSET -- the first address of the first page of memory. With MMU
  * it is set to the kernel start address (aligned on a page boundary).

[PATCH 3/3] sh: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas

The sh architecture defines ARCH_DMA_MINALIGN in asm/page.h. Move it to
asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition in
linux/cache.h without redefine errors/warnings.

Signed-off-by: Catalin Marinas 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: linux...@vger.kernel.org
---
 arch/sh/include/asm/cache.h | 6 ++
 arch/sh/include/asm/page.h  | 6 --
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/sh/include/asm/cache.h b/arch/sh/include/asm/cache.h
index 32dfa6b82ec6..b38dbc975581 100644
--- a/arch/sh/include/asm/cache.h
+++ b/arch/sh/include/asm/cache.h
@@ -14,6 +14,12 @@
 
 #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
 
+/*
+ * Some drivers need to perform DMA into kmalloc'ed buffers
+ * and so we have to increase the kmalloc minalign for this.
+ */
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+
 #define __read_mostly __section(".data..read_mostly")
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
index 09ac6c7faee0..62f4b9edcb98 100644
--- a/arch/sh/include/asm/page.h
+++ b/arch/sh/include/asm/page.h
@@ -174,10 +174,4 @@ typedef struct page *pgtable_t;
 #include 
 #include 
 
-/*
- * Some drivers need to perform DMA into kmalloc'ed buffers
- * and so we have to increase the kmalloc minalign for this.
- */
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-
 #endif /* __ASM_SH_PAGE_H */

[PATCH 1/3] powerpc: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas

The powerpc architecture defines ARCH_DMA_MINALIGN in asm/page_32.h and
only if CONFIG_NOT_COHERENT_CACHE is enabled (32-bit platforms only).
Move this macro to asm/cache.h to allow a generic ARCH_DMA_MINALIGN
definition in linux/cache.h without redefine errors/warnings.

Signed-off-by: Catalin Marinas 
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202306131053.1ybvrrho-...@intel.com/
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/cache.h   | 4 
 arch/powerpc/include/asm/page_32.h | 4 
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index ae0a68a838e8..69232231d270 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -33,6 +33,10 @@
 
 #define IFETCH_ALIGN_BYTES (1 << IFETCH_ALIGN_SHIFT)
 
+#ifdef CONFIG_NOT_COHERENT_CACHE
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+#endif
+
 #if !defined(__ASSEMBLY__)
 #ifdef CONFIG_PPC64
 
diff --git a/arch/powerpc/include/asm/page_32.h 
b/arch/powerpc/include/asm/page_32.h
index 56f217606327..b9ac9e3a771c 100644
--- a/arch/powerpc/include/asm/page_32.h
+++ b/arch/powerpc/include/asm/page_32.h
@@ -12,10 +12,6 @@
 
 #define VM_DATA_DEFAULT_FLAGS  VM_DATA_DEFAULT_FLAGS32
 
-#ifdef CONFIG_NOT_COHERENT_CACHE
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-#endif
-
 #if defined(CONFIG_PPC_256K_PAGES) || \
 (defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES))
 #define PTE_SHIFT  (PAGE_SHIFT - PTE_T_LOG2 - 2)   /* 1/4 of a page */

[PATCH 0/3] Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas

Hi,

The ARCH_KMALLOC_MINALIGN reduction series defines a generic
ARCH_DMA_MINALIGN in linux/cache.h:

https://lore.kernel.org/r/20230612153201.554742-2-catalin.mari...@arm.com/

Unfortunately, this causes a duplicate definition warning for
microblaze, powerpc (32-bit only) and sh as these architectures define
ARCH_DMA_MINALIGN in a different file than asm/cache.h. Move the macro
to asm/cache.h to avoid this issue and also bring them in line with the
other architectures.

Andrew, if the arch maintainers cc'ed are fine with such change, could
you please take these three patches together with the
ARCH_KMALLOC_MINALIGN series?

Thank you.

Catalin Marinas (3):
  powerpc: Move the ARCH_DMA_MINALIGN definition to asm/cache.h
  microblaze: Move the ARCH_{DMA,SLAB}_MINALIGN definitions to
asm/cache.h
  sh: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

 arch/microblaze/include/asm/cache.h | 5 +
 arch/microblaze/include/asm/page.h  | 5 -
 arch/powerpc/include/asm/cache.h| 4 
 arch/powerpc/include/asm/page_32.h  | 4 
 arch/sh/include/asm/cache.h | 6 ++
 arch/sh/include/asm/page.h  | 6 --
 6 files changed, 15 insertions(+), 15 deletions(-)

Re: linux-next: build failure after merge of the mm tree

2023-06-13 Thread Catalin Marinas

Hi Stephen,

On Tue, Jun 13, 2023 at 04:21:19PM +1000, Stephen Rothwell wrote:
> After merging the mm tree, today's linux-next build (powerpc
> ppc44x_defconfig) failed like this:
> 
> In file included from arch/powerpc/include/asm/page.h:247,
>  from arch/powerpc/include/asm/thread_info.h:13,
>  from include/linux/thread_info.h:60,
>  from include/asm-generic/preempt.h:5,
>  from ./arch/powerpc/include/generated/asm/preempt.h:1,
>  from include/linux/preempt.h:78,
>  from include/linux/spinlock.h:56,
>  from include/linux/ipc.h:5,
>  from include/uapi/linux/sem.h:5,
>  from include/linux/sem.h:5,
>  from include/linux/compat.h:14,
>  from arch/powerpc/kernel/asm-offsets.c:12:
> arch/powerpc/include/asm/page_32.h:16: warning: "ARCH_DMA_MINALIGN" redefined
>16 | #define ARCH_DMA_MINALIGN   L1_CACHE_BYTES
>   | 
> In file included from include/linux/time.h:5,
>  from include/linux/compat.h:10:
> include/linux/cache.h:104: note: this is the location of the previous 
> definition
>   104 | #define ARCH_DMA_MINALIGN __alignof__(unsigned long long)
>   | 
> 
> (lots of theses)
> 
> Caused by commit
> 
>   cc7335787e73 ("mm/slab: decouple ARCH_KMALLOC_MINALIGN from 
> ARCH_DMA_MINALIGN")
> 
> I have applied the following hack for today - we need something better.

I just posted this series fixing it for powerpc, microblaze and sh. I
did not add the #ifndef __powerpc64__ line since
CONFIG_NOT_COHERENT_CACHE should not be enabled for those builds.

https://lore.kernel.org/r/20230613155245.1228274-1-catalin.mari...@arm.com

> From: Stephen Rothwell 
> Date: Tue, 13 Jun 2023 16:07:16 +1000
> Subject: [PATCH] fix up for "mm/slab: decouple ARCH_KMALLOC_MINALIGN from 
> ARCH_DMA_MINALIGN"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  arch/powerpc/include/asm/cache.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/cache.h 
> b/arch/powerpc/include/asm/cache.h
> index ae0a68a838e8..e9be1396dfd1 100644
> --- a/arch/powerpc/include/asm/cache.h
> +++ b/arch/powerpc/include/asm/cache.h
> @@ -142,5 +142,14 @@ static inline void iccci(void *addr)
>  }
>  
>  #endif /* !__ASSEMBLY__ */
> +
> +#ifndef __powerpc64__
> +#ifdef CONFIG_NOT_COHERENT_CACHE
> +#ifndef ARCH_DMA_MINALIGN
> +#define ARCH_DMA_MINALIGNL1_CACHE_BYTES
> +#endif
> +#endif
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_CACHE_H */

I think it should also remove the ARCH_DMA_MINALIGN from asm/page.h (as
I did in my series; sorry I did not cc you, only noticed now that you
reported it as well).

-- 
Catalin

Re: [PATCH] irq_work: consolidate arch_irq_work_raise prototypes

2023-05-25 Thread Catalin Marinas

On Tue, May 16, 2023 at 10:02:31PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The prototype was hidden on x86, which causes a warning:
> 
> kernel/irq_work.c:72:13: error: no previous prototype for 
> 'arch_irq_work_raise' [-Werror=missing-prototypes]
> 
> Fix this by providing it in only one place that is always visible.
> 
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas

Re: [PATCH 03/23] arm64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-05-25 Thread Catalin Marinas

On Tue, May 09, 2023 at 09:45:57PM -0700, Hugh Dickins wrote:
> pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
> that: to keep balance in future, use the recently added pte_alloc_huge()
> instead; with pte_offset_huge() a better name for pte_offset_kernel().
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas

Re: [PATCH 02/23] arm64: allow pte_offset_map() to fail

2023-05-25 Thread Catalin Marinas

On Tue, May 09, 2023 at 09:43:47PM -0700, Hugh Dickins wrote:
> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas

Re: [PATCH 13/14] thread_info: move function declarations to linux/thread_info.h

2023-05-24 Thread Catalin Marinas

On Wed, May 17, 2023 at 03:11:01PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> There are a few __weak functions in kernel/fork.c, which architectures
> can override. If there is no prototype, the compiler warns about them:
> 
> kernel/fork.c:164:13: error: no previous prototype for 
> 'arch_release_task_struct' [-Werror=missing-prototypes]
> kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' 
> [-Werror=missing-prototypes]
> kernel/fork.c:1086:12: error: no previous prototype for 
> 'arch_dup_task_struct' [-Werror=missing-prototypes]
> 
> There are already prototypes in a number of architecture specific headers
> that have addressed those warnings before, but it's much better to have
> these in a single place so the warning no longer shows up anywhere.
> 
> Signed-off-by: Arnd Bergmann 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-27 Thread Catalin Marinas

On Tue, Apr 25, 2023 at 11:09:58AM -0500, Justin Forbes wrote:
> On Tue, Apr 18, 2023 at 5:22 PM Andrew Morton  
> wrote:
> > On Wed, 12 Apr 2023 18:27:08 +0100 Catalin Marinas 
> >  wrote:
> > > > It sounds nice in theory. In practice. EXPERT hides too much. When you
> > > > flip expert, you expose over a 175ish new config options which are
> > > > hidden behind EXPERT.  You don't have to know what you are doing just
> > > > with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> > > > already running 10, this might be less of a problem. At least Fedora
> > > > and RHEL are running 13 for 4K pages on aarch64. This was not some
> > > > accidental choice, we had to carry a patch to even allow it for a
> > > > while.  If this does go in as is, we will likely just carry a patch to
> > > > remove the "if EXPERT", but that is a bit of a disservice to users who
> > > > might be trying to debug something else upstream, bisecting upstream
> > > > kernels or testing a patch.  In those cases, people tend to use
> > > > pristine upstream sources without distro patches to verify, and they
> > > > tend to use their existing configs. With this change, their MAX_ORDER
> > > > will drop to 10 from 13 silently.   That can look like a different
> > > > issue enough to ruin a bisect or have them give bad feedback on a
> > > > patch because it introduces a "regression" which is not a regression
> > > > at all, but a config change they couldn't see.
> > >
> > > If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
> > > and avoid having to explain to people why some random MAX_ORDER doesn't
> > > build (keeping the range would also make sense for randconfig, not sure
> > > we got to any conclusion there).
> >
> > Well this doesn't seem to have got anywhere.  I think I'll send the
> > patchset into Linus for the next merge window as-is.  Please let's take
> > a look at this Kconfig presentation issue during the following -rc
> > cycle.
> 
> Well, I am very sorry to see this going in as is.  It will silently
> change people building with oldconfig, and anyone not paying attention
> will not notice until an issue is hit where "it worked before, and my
> config hasn't changed".  If EXPERT is unset, there is no notification,
> just a changed behavior.  While it would be easy for me to carry a
> patch dropping the if EXPERT, it will not help any users building on
> upstream with our configs, whether for their own regular use, or while
> trying to debug other issues,  I expect it will result in a reasonable
> amount of frustration from users trying to do the right thing and
> bisect or test patches upstream.

As I said in a previous reply, I'm fine with reverting this commit if it
breaks existing configs. It's only that Andrew had already queued it in
his tree but we have time until the final 6.4 kernel is released.

That said, would you mind sending a patch reverting it (if removing
EXPERT, I'd like to keep the ranges)? ;)

Thanks.

-- 
Catalin

Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-19 Thread Catalin Marinas

On Tue, Apr 18, 2023 at 03:05:57PM -0700, Andrew Morton wrote:
> On Wed, 12 Apr 2023 18:27:08 +0100 Catalin Marinas  
> wrote:
> > > It sounds nice in theory. In practice. EXPERT hides too much. When you
> > > flip expert, you expose over a 175ish new config options which are
> > > hidden behind EXPERT.  You don't have to know what you are doing just
> > > with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> > > already running 10, this might be less of a problem. At least Fedora
> > > and RHEL are running 13 for 4K pages on aarch64. This was not some
> > > accidental choice, we had to carry a patch to even allow it for a
> > > while.  If this does go in as is, we will likely just carry a patch to
> > > remove the "if EXPERT", but that is a bit of a disservice to users who
> > > might be trying to debug something else upstream, bisecting upstream
> > > kernels or testing a patch.  In those cases, people tend to use
> > > pristine upstream sources without distro patches to verify, and they
> > > tend to use their existing configs. With this change, their MAX_ORDER
> > > will drop to 10 from 13 silently.   That can look like a different
> > > issue enough to ruin a bisect or have them give bad feedback on a
> > > patch because it introduces a "regression" which is not a regression
> > > at all, but a config change they couldn't see.
> > 
> > If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
> > and avoid having to explain to people why some random MAX_ORDER doesn't
> > build (keeping the range would also make sense for randconfig, not sure
> > we got to any conclusion there).
> 
> Well this doesn't seem to have got anywhere.  I think I'll send the
> patchset into Linus for the next merge window as-is.  Please let's take
> a look at this Kconfig presentation issue during the following -rc
> cycle.

That's fine by me. I have a slight preference to drop EXPERT and keep
the ranges in, especially if it affects current distro kernels. Debian
seems to enable EXPERT already in their arm64 kernel config but I'm not
sure about the Fedora or other distro kernels. If they don't, we can
fix/revert this Kconfig entry once the merging window is closed.

-- 
Catalin

Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-12 Thread Catalin Marinas

On Tue, Apr 04, 2023 at 06:50:01AM -0500, Justin Forbes wrote:
> On Tue, Apr 4, 2023 at 2:22 AM Mike Rapoport  wrote:
> > On Wed, Mar 29, 2023 at 10:55:37AM -0500, Justin Forbes wrote:
> > > On Sat, Mar 25, 2023 at 1:09 AM Mike Rapoport  wrote:
> > > >
> > > > From: "Mike Rapoport (IBM)" 
> > > >
> > > > It is not a good idea to change fundamental parameters of core memory
> > > > management. Having predefined ranges suggests that the values within
> > > > those ranges are sensible, but one has to *really* understand
> > > > implications of changing MAX_ORDER before actually amending it and
> > > > ranges don't help here.
> > > >
> > > > Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt
> > > > visible only if EXPERT=y
> > >
> > > I do not like suddenly hiding this behind EXPERT for a couple of
> > > reasons.  Most importantly, it will silently change the config for
> > > users building with an old kernel config.  If a user has for instance
> > > "13" set and building with 4K pages, as is the current configuration
> > > for Fedora and RHEL aarch64 builds, an oldconfig build will now set it
> > > to 10 with no indication that it is doing so.  And while I think that
> > > 10 is a fine default for many aarch64 users, there are valid reasons
> > > for choosing other values. Putting this behind expert makes it much
> > > less obvious that this is an option.
> >
> > That's the idea of EXPERT, no?
> >
> > This option was intended to allow allocation of huge pages for
> > architectures that had PMD_ORDER > MAX_ORDER and not to allow user to
> > select size of maximal physically contiguous allocation.
> >
> > Changes to MAX_ORDER fundamentally change the behaviour of core mm and
> > unless users *really* know what they are doing there is no reason to choose
> > non-default values so hiding this option behind EXPERT seems totally
> > appropriate to me.
> 
> It sounds nice in theory. In practice. EXPERT hides too much. When you
> flip expert, you expose over a 175ish new config options which are
> hidden behind EXPERT.  You don't have to know what you are doing just
> with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> already running 10, this might be less of a problem. At least Fedora
> and RHEL are running 13 for 4K pages on aarch64. This was not some
> accidental choice, we had to carry a patch to even allow it for a
> while.  If this does go in as is, we will likely just carry a patch to
> remove the "if EXPERT", but that is a bit of a disservice to users who
> might be trying to debug something else upstream, bisecting upstream
> kernels or testing a patch.  In those cases, people tend to use
> pristine upstream sources without distro patches to verify, and they
> tend to use their existing configs. With this change, their MAX_ORDER
> will drop to 10 from 13 silently.   That can look like a different
> issue enough to ruin a bisect or have them give bad feedback on a
> patch because it introduces a "regression" which is not a regression
> at all, but a config change they couldn't see.

If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
and avoid having to explain to people why some random MAX_ORDER doesn't
build (keeping the range would also make sense for randconfig, not sure
we got to any conclusion there).

-- 
Catalin

Re: [PATCH 18/21] ARM: drop SMP support for ARM11MPCore

2023-03-31 Thread Catalin Marinas

On Mon, Mar 27, 2023 at 02:13:14PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
> 
>  * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
>DMA buffers lead to data corruption when the prefetched data is written
>back on top of data from the device.
> 
>  * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
>is not seen by the other core(s), leading to inconsistent contents
>accross the system.
> 
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.

As the author of this terrible hack (created under duress ;))

Acked-by: Catalin Marinas 

IIRC, RWFO is working in combination with the cache operations. Because
the cache maintenance broadcast did not happen, we forced the cache
lines to migrate to a CPU via a write (for ownership) and doing the
cache maintenance on that CPU (that was the FROM_DEVICE case). For the
TO_DEVICE case, reading on a CPU would cause dirty lines on another CPU
to be evicted (or migrated as dirty to the current CPU IIRC) then the
cache maintenance to clean them to PoC on the local CPU.

But there's always a small window between read/write for ownership and
the actual cache maintenance which can cause a cache line to migrate to
other CPUs if they do speculative prefetches. At the time ARM11MPCore
was deemed safe-ish but I haven't followed what later implementations
actually did (luckily we fixed the architecture in ARMv7).

-- 
Catalin

Re: [PATCH 00/21] dma-mapping: unify support for cache flushes

2023-03-31 Thread Catalin Marinas

On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote:
> Another difference that I do not address here is what cache invalidation
> does for partical cache lines. On arm32, arm64 and powerpc, a partial
> cache line always gets written back before invalidation in order to
> ensure that data before or after the buffer is not discarded. On all
> other architectures, the assumption is cache lines are never shared
> between DMA buffer and data that is accessed by the CPU.

I don't think sharing the DMA buffer with other data is safe even with
this clean+invalidate on the unaligned cache. Mapping the DMA buffer as
FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be
evicted and override the device written data. This sharing only works if
the CPU guarantees not to dirty the corresponding cache line.

I'm fine with removing this partial cache line hack from arm64 as it's
not safe anyway. We'll see if any driver stops working. If there's some
benign sharing (I wouldn't trust it), the cache cleaning prior to
mapping and invalidate on unmap would not lose any data.

-- 
Catalin

Re: [PATCH 03/14] arm64: reword ARCH_FORCE_MAX_ORDER prompt and help text

2023-03-23 Thread Catalin Marinas

On Thu, Mar 23, 2023 at 11:21:45AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
> describe this configuration option.
> 
> Update both to actually describe what this option does.
> 
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Catalin Marinas

Re: [PATCH 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-03-23 Thread Catalin Marinas

On Thu, Mar 23, 2023 at 11:21:44AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> It is not a good idea to change fundamental parameters of core memory
> management. Having predefined ranges suggests that the values within
> those ranges are sensible, but one has to *really* understand
> implications of changing MAX_ORDER before actually amending it and
> ranges don't help here.
> 
> Drop ranges in definition of ARCH_FORCE_MAX_ORDER
> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/arm64/Kconfig | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e60baf7859d1..bab6483e4317 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1489,9 +1489,7 @@ config XEN
>  config ARCH_FORCE_MAX_ORDER
>   int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES
>   default "13" if ARM64_64K_PAGES
> - range 11 13 if ARM64_16K_PAGES
>   default "11" if ARM64_16K_PAGES
> - range 10 15 if ARM64_4K_PAGES
>   default "10"

I don't mind rewriting the help text as in the subsequent patch but I'd
keep the ranges as a safety measure. It's less wasted time explaining to
people why some random max order doesn't work. Alternatively, we can
drop the ranges but make this option configurable only if EXPERT.

-- 
Catalin

Re: [PATCH] mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()

2023-03-06 Thread Catalin Marinas

On Mon, Mar 06, 2023 at 05:15:48PM +0100, Gerald Schaefer wrote:
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index b6ba466e2e8a..0bd18de9fd97 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
>   * fault on one CPU which has been handled concurrently by another CPU
>   * does not need to perform additional invalidation.
>   */
> -#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)

For arm64:

Acked-by: Catalin Marinas 

> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 2c70b4d1263d..c1f6b46ec555 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1239,7 +1239,8 @@ static inline int pte_allow_rdp(pte_t old, pte_t new)
>  }
>  
>  static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> - unsigned long address)
> + unsigned long address,
> + pte_t *ptep)
>  {
>   /*
>* RDP might not have propagated the PTE protection reset to all CPUs,
> @@ -1247,11 +1248,12 @@ static inline void 
> flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
>* NOTE: This will also be called when a racing pagetable update on
>* another thread already installed the correct PTE. Both cases cannot
>* really be distinguished.
> -  * Therefore, only do the local TLB flush when RDP can be used, to avoid
> -  * unnecessary overhead.
> +  * Therefore, only do the local TLB flush when RDP can be used, and the
> +  * PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
> +  * A local RDP can be used to do the flush.
>*/
> - if (MACHINE_HAS_RDP)
> - asm volatile("ptlb" : : : "memory");
> + if (MACHINE_HAS_RDP && !(pte_val(*ptep) & _PAGE_PROTECT))
> + __ptep_rdp(address, ptep, 0, 0, 1);

I wonder whether passing the actual entry is somewhat quicker as it
avoids another memory access (though it might already be in the cache).

-- 
Catalin

Re: [PATCH v3 02/24] arm64: Remove COMMAND_LINE_SIZE from uapi

2023-02-14 Thread Catalin Marinas

On Tue, Feb 14, 2023 at 08:49:03AM +0100, Alexandre Ghiti wrote:
> From: Palmer Dabbelt 
> 
> As far as I can tell this is not used by userspace and thus should not
> be part of the user-visible API.
> 
> Signed-off-by: Palmer Dabbelt 

Acked-by: Catalin Marinas

Re: (subset) [PATCH 00/35] Documentation: correct lots of spelling errors (series 1)

2023-01-31 Thread Catalin Marinas

On Thu, 26 Jan 2023 22:39:30 -0800, Randy Dunlap wrote:
> Correct many spelling errors in Documentation/ as reported by codespell.
> 
> Maintainers of specific kernel subsystems are only Cc-ed on their
> respective patches, not the entire series. [if all goes well]
> 
> These patches are based on linux-next-20230125.
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[01/35] Documentation: arm64: correct spelling
https://git.kernel.org/arm64/c/a70f00e7f1a3

-- 
Catalin

Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-01-09 Thread Catalin Marinas

On Sun, Jan 08, 2023 at 06:48:41PM +0800, Barry Song wrote:
> On Fri, Jan 6, 2023 at 2:15 AM Catalin Marinas  
> wrote:
> > On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> > > It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> > > large systems but may not have improvement on small systems like on
> > > a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> > > on CONFIG_EXPERT for this stage and make this disabled on systems
> > > with less than 8 CPUs. User can modify this threshold according to
> > > their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
> >
> > What's the overhead of such batching on systems with 4 or fewer CPUs? If
> > it isn't noticeable, I'd rather have it always on than some number
> > chosen on whichever SoC you tested.
> 
> On the one hand, tlb flush is cheap on a small system. so batching tlb flush
> helps very minorly.

Yes, it probably won't help on small systems but I don't like config
options choosing the threshold, which may be different from system to
system even if they have the same number of CPUs. A run-time tunable
would be a better option.

> On the other hand, since we have batched the tlb flush, new PTEs might be
> invisible to others before the final broadcast is done and Ack-ed.

The new PTEs could indeed be invisible at the TLB level but not at the
memory (page table) level since this is done under the PTL IIUC.

> thus, there
> is a risk someone else might do mprotect or similar things  on those deferred
> pages which will ask for read-modify-write on those deferred PTEs.

And this should be fine, we have things like the PTL in place for the
actual memory access to the page table.

> in this
> case, mm will do an explicit flush by flush_tlb_batched_pending which is
> not required if tlb flush is not deferred.

I don't fully understand why it's needed, or at least why it would be
needed on arm64. At the end of an mprotect(), we have the final PTEs in
place and we just need to issue a TLBI for that range.
change_pte_range() for example has a tlb_flush_pte_range() if the PTE
was present and that won't be done lazily. If there are other TLBIs
pending for the same range, they'll be done later though likely
unnecessarily but still cheaper than issuing a flush_tlb_mm().

> void flush_tlb_batched_pending(struct mm_struct *mm)
> {
>int batch = atomic_read(>tlb_flush_batched);
>int pending = batch & TLB_FLUSH_BATCH_PENDING_MASK;
>int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> 
>if (pending != flushed) {
>flush_tlb_mm(mm);
> /*
>  * If the new TLB flushing is pending during flushing, leave
>  * mm->tlb_flush_batched as is, to avoid losing flushing.
> */
>   atomic_cmpxchg(>tlb_flush_batched, batch,
>pending | (pending << TLB_FLUSH_BATCH_FLUSHED_SHIFT));
>  }
> }

I guess this works on x86 better as it avoids the IPIs if this flush
already happened. But on arm64 we already issued the TLBI, we just
didn't wait for it to complete via a DSB.

> I believe Anshuman has contributed many points on this in those previous
> discussions.

Yeah, I should re-read the old threads.

-- 
Catalin

Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-01-05 Thread Catalin Marinas

On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and make this disabled on systems
> with less than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

What's the overhead of such batching on systems with 4 or fewer CPUs? If
it isn't noticeable, I'd rather have it always on than some number
chosen on whichever SoC you tested.

Another option would be to make this a sysctl tunable.

>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  6 +++
>  arch/arm64/include/asm/tlbbatch.h | 12 +
>  arch/arm64/include/asm/tlbflush.h | 52 ++-
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 10 ++--

Please keep any function prototype changes in a preparatory patch so
that the arm64 one only introduces the arch specific changes. Easier to
review.

> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> + /*
> +  * TLB batched flush is proved to be beneficial for systems with large
> +  * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +  * is cheap on small systems which may not need this feature. So use
> +  * a threshold for enabling this to avoid potential side effects on
> +  * these platforms.
> +  */
> + if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> + return false;

The x86 implementation tracks the cpumask of where a task has run. We
don't have such tracking on arm64 and I don't think it matters. As
noticed/described in this series, the bottleneck is the actual DSB
synchronisation (which sends a DVM Sync message to all the other CPUs
and waits for a DVM Complete response). So I think it makes sense not to
bother with an mm_cpumask(). What this patch aims to optimise is
actually the number of DSBs issued on an SMP system by
ptep_clear_flush().

The DVM is not an architected concept (well, it's part of AMBA AXI). I'd
be curious to know how such patch behaves on Apple's M1/M2 hardware. My
preference would be to have this always on for num_online_cpus() > 1 if
there's no overhead.

-- 
Catalin

Re: [PATCH v1 2/2] stackprotector: actually use get_random_canary()

2022-11-09 Thread Catalin Marinas

On Sun, Oct 23, 2022 at 10:32:08PM +0200, Jason A. Donenfeld wrote:
> The RNG always mixes in the Linux version extremely early in boot. It
> also always includes a cycle counter, not only during early boot, but
> each and every time it is invoked prior to being fully initialized.
> Together, this means that the use of additional xors inside of the
> various stackprotector.h files is superfluous and over-complicated.
> Instead, we can get exactly the same thing, but better, by just calling
> `get_random_canary()`.
> 
> Signed-off-by: Jason A. Donenfeld 
> ---
>  arch/arm/include/asm/stackprotector.h |  9 +
>  arch/arm64/include/asm/stackprotector.h   |  9 +----

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH] mm: remove kern_addr_valid() completely

2022-11-02 Thread Catalin Marinas

On Tue, Oct 18, 2022 at 03:40:14PM +0800, Kefeng Wang wrote:
> Most architectures(except arm64/x86/sparc) simply return 1 for
> kern_addr_valid(), which is only used in read_kcore(), and it
> calls copy_from_kernel_nofault() which could check whether the
> address is a valid kernel address, so no need kern_addr_valid(),
> let's remove unneeded kern_addr_valid() completely.
> 
> Signed-off-by: Kefeng Wang 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH] kernel: exit: cleanup release_thread()

2022-08-21 Thread Catalin Marinas

On Fri, Aug 19, 2022 at 09:44:06AM +0800, Kefeng Wang wrote:
> Only x86 has own release_thread(), introduce a new weak
> release_thread() function to clean empty definitions in
> other ARCHs.
> 
> Signed-off-by: Kefeng Wang 
[...]
>  arch/arm64/include/asm/processor.h  | 3 ---
>  arch/arm64/kernel/process.c | 4 ----

Acked-by: Catalin Marinas

Re: [PATCH] arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER

2022-08-16 Thread Catalin Marinas

On Mon, Aug 15, 2022 at 10:39:59AM -0400, Zi Yan wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 571cc234d0b3..c6fcd8746f60 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1401,7 +1401,7 @@ config XEN
>   help
> Say Y if you want to run Linux in a Virtual Machine on Xen on ARM64.
>  
> -config FORCE_MAX_ZONEORDER
> +config ARCH_FORCE_MAX_ORDER
>   int
>   default "14" if ARM64_64K_PAGES
>   default "12" if ARM64_16K_PAGES

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v5] random: remove CONFIG_ARCH_RANDOM

2022-07-13 Thread Catalin Marinas

On Fri, Jul 08, 2022 at 02:40:32AM +0200, Jason A. Donenfeld wrote:
> When RDRAND was introduced, there was much discussion on whether it
> should be trusted and how the kernel should handle that. Initially, two
> mechanisms cropped up, CONFIG_ARCH_RANDOM, a compile time switch, and
> "nordrand", a boot-time switch.
> 
> Later the thinking evolved. With a properly designed RNG, using RDRAND
> values alone won't harm anything, even if the outputs are malicious.
> Rather, the issue is whether those values are being *trusted* to be good
> or not. And so a new set of options were introduced as the real
> ones that people use -- CONFIG_RANDOM_TRUST_CPU and "random.trust_cpu".
> With these options, RDRAND is used, but it's not always credited. So in
> the worst case, it does nothing, and in the best case, maybe it helps.
> 
> Along the way, CONFIG_ARCH_RANDOM's meaning got sort of pulled into the
> center and became something certain platforms force-select.
> 
> The old options don't really help with much, and it's a bit odd to have
> special handling for these instructions when the kernel can deal fine
> with the existence or untrusted existence or broken existence or
> non-existence of that CPU capability.
> 
> Simplify the situation by removing CONFIG_ARCH_RANDOM and using the
> ordinary asm-generic fallback pattern instead, keeping the two options
> that are actually used. For now it leaves "nordrand" for now, as the
> removal of that will take a different route.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Michael Ellerman 
> Cc: Alexander Gordeev 
> Cc: Thomas Gleixner 
> Cc: H. Peter Anvin 
> Acked-by: Borislav Petkov 
> Acked-by: Heiko Carstens 
> Acked-by: Greg Kroah-Hartman 
> Signed-off-by: Jason A. Donenfeld 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH V4 05/26] arm64/mm: Move protection_map[] inside the platform

2022-06-24 Thread Catalin Marinas

On Fri, Jun 24, 2022 at 10:13:18AM +0530, Anshuman Khandual wrote:
> This moves protection_map[] inside the platform and makes it a static.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Catalin Marinas

Re: [PATCH v4] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-30 Thread Catalin Marinas

onstr.eu>, Thomas Bogendoerfer , 
linux-par...@vger.kernel.org, Max Filippov , 
linux-ker...@vger.kernel.org, Johannes Berg , Dinh 
Nguyen , linux-ri...@lists.infradead.org, Palmer Dabbelt 
, Sven Schnelle , 
linux-al...@vger.kernel.org, Ivan Kokshaysky , 
Andrew Morton , linuxppc-dev@lists.ozlabs.org, 
"David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 

On Fri, May 27, 2022 at 03:39:36PM -0400, Peter Xu wrote:
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 77341b160aca..e401d416bbd6 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -607,6 +607,10 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned int esr,
>   return 0;
>   }
>  
> + /* The fault is fully completed (including releasing mmap lock) */
> + if (fault & VM_FAULT_COMPLETED)
> + return 0;
> +
>   if (fault & VM_FAULT_RETRY) {
>   mm_flags |= FAULT_FLAG_TRIED;
>   goto retry;

For arm64:

Acked-by: Catalin Marinas

Re: (subset) [PATCH -next v4 0/7]arm64: add machine check safe support

2022-05-16 Thread Catalin Marinas

On Wed, 20 Apr 2022 03:04:11 +, Tong Tiangen wrote:
> With the increase of memory capacity and density, the probability of
> memory error increases. The increasing size and density of server RAM
> in the data center and cloud have shown increased uncorrectable memory
> errors.
> 
> Currently, the kernel has a mechanism to recover from hardware memory
> errors. This patchset provides an new recovery mechanism.
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[5/7] arm64: mte: Clean up user tag accessors
  https://git.kernel.org/arm64/c/b4d6bb38f9dc

-- 
Catalin

Re: [PATCH] bug: Use normal relative pointers in 'struct bug_entry'

2022-05-10 Thread Catalin Marinas

On Thu, May 05, 2022 at 06:09:45PM -0700, Josh Poimboeuf wrote:
> With CONFIG_GENERIC_BUG_RELATIVE_POINTERS, the addr/file relative
> pointers are calculated weirdly: based on the beginning of the bug_entry
> struct address, rather than their respective pointer addresses.
> 
> Make the relative pointers less surprising to both humans and tools by
> calculating them the normal way.
> 
> Signed-off-by: Josh Poimboeuf 
> ---
>  arch/arm64/include/asm/asm-bug.h |  4 ++--

Acked-by: Catalin Marinas

Re: [PATCH -next v4 4/7] arm64: add copy_{to, from}_user to machine check safe

2022-05-05 Thread Catalin Marinas

On Thu, May 05, 2022 at 02:39:43PM +0800, Tong Tiangen wrote:
> 在 2022/5/4 18:26, Catalin Marinas 写道:
> > On Wed, Apr 20, 2022 at 03:04:15AM +, Tong Tiangen wrote:
> > > Add copy_{to, from}_user() to machine check safe.
> > > 
> > > If copy fail due to hardware memory error, only the relevant processes are
> > > affected, so killing the user process and isolate the user page with
> > > hardware memory errors is a more reasonable choice than kernel panic.
> > 
> > Just to make sure I understand - we can only recover if the fault is in
> > a user page. That is, for a copy_from_user(), we can only handle the
> > faults in the source address, not the destination.
> 
> At the beginning, I also thought we can only recover if the fault is in a
> user page.
> After discussion with a Mark[1], I think no matter user page or kernel page,
> as long as it is triggered by the user process, only related processes will
> be affected. According to this
> understanding, it seems that all uaccess can be recovered.
> 
> [1]https://patchwork.kernel.org/project/linux-arm-kernel/patch/20220406091311.3354723-6-tongtian...@huawei.com/

We can indeed safely skip this copy and return an error just like
pretending there was a user page fault. However, my point was more
around the "isolate the user page with hardware memory errors". If the
fault is on a kernel address, there's not much you can do about. You'll
likely trigger it later when you try to access that address (maybe it
was freed and re-allocated). Do we hope we won't get the same error
again on that kernel address?

-- 
Catalin

Re: (subset) [PATCH -next v4 0/7]arm64: add machine check safe support

2022-05-04 Thread Catalin Marinas

On Wed, 20 Apr 2022 03:04:11 +, Tong Tiangen wrote:
> With the increase of memory capacity and density, the probability of
> memory error increases. The increasing size and density of server RAM
> in the data center and cloud have shown increased uncorrectable memory
> errors.
> 
> Currently, the kernel has a mechanism to recover from hardware memory
> errors. This patchset provides an new recovery mechanism.
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[2/7] arm64: fix types in copy_highpage()
  https://git.kernel.org/arm64/c/921d161f15d6

-- 
Catalin

Re: [PATCH -next v4 4/7] arm64: add copy_{to, from}_user to machine check safe

2022-05-04 Thread Catalin Marinas

On Wed, Apr 20, 2022 at 03:04:15AM +, Tong Tiangen wrote:
> Add copy_{to, from}_user() to machine check safe.
> 
> If copy fail due to hardware memory error, only the relevant processes are
> affected, so killing the user process and isolate the user page with
> hardware memory errors is a more reasonable choice than kernel panic.

Just to make sure I understand - we can only recover if the fault is in
a user page. That is, for a copy_from_user(), we can only handle the
faults in the source address, not the destination.

> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
> index 34e317907524..480cc5ac0a8d 100644
> --- a/arch/arm64/lib/copy_from_user.S
> +++ b/arch/arm64/lib/copy_from_user.S
> @@ -25,7 +25,7 @@
>   .endm
>  
>   .macro strb1 reg, ptr, val
> - strb \reg, [\ptr], \val
> + USER_MC(9998f, strb \reg, [\ptr], \val)
>   .endm

So if I got the above correctly, why do we need an exception table entry
for the store to the kernel address?

-- 
Catalin

Re: [PATCH V4 7/7] mm/mmap: Drop arch_vm_get_page_pgprot()

2022-04-08 Thread Catalin Marinas

On Thu, Apr 07, 2022 at 04:02:51PM +0530, Anshuman Khandual wrote:
> There are no platforms left which use arch_vm_get_page_prot(). Just drop
> generic arch_vm_get_page_prot().
> 
> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Catalin Marinas

Re: [PATCH V4 6/7] mm/mmap: Drop arch_filter_pgprot()

2022-04-08 Thread Catalin Marinas

On Thu, Apr 07, 2022 at 04:02:50PM +0530, Anshuman Khandual wrote:
> There are no platforms left which subscribe ARCH_HAS_FILTER_PGPROT. Hence
> drop generic arch_filter_pgprot() and also config ARCH_HAS_FILTER_PGPROT.
> 
> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Catalin Marinas

Re: [PATCH V4 3/7] arm64/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-04-08 Thread Catalin Marinas

On Thu, Apr 07, 2022 at 04:02:47PM +0530, Anshuman Khandual wrote:
> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
> index 77ada00280d9..307534fcec00 100644
> --- a/arch/arm64/mm/mmap.c
> +++ b/arch/arm64/mm/mmap.c
> @@ -55,3 +55,36 @@ static int __init adjust_protection_map(void)
>   return 0;
>  }
>  arch_initcall(adjust_protection_map);
> +
> +static pgprot_t arm64_arch_vm_get_page_prot(unsigned long vm_flags)
> +{
> + pteval_t prot = 0;
> +
> + if (vm_flags & VM_ARM64_BTI)
> + prot |= PTE_GP;
> +
> + /*
> +  * There are two conditions required for returning a Normal Tagged
> +  * memory type: (1) the user requested it via PROT_MTE passed to
> +  * mmap() or mprotect() and (2) the corresponding vma supports MTE. We
> +  * register (1) as VM_MTE in the vma->vm_flags and (2) as
> +  * VM_MTE_ALLOWED. Note that the latter can only be set during the
> +  * mmap() call since mprotect() does not accept MAP_* flags.
> +  * Checking for VM_MTE only is sufficient since arch_validate_flags()
> +  * does not permit (VM_MTE & !VM_MTE_ALLOWED).
> +  */
> + if (vm_flags & VM_MTE)
> + prot |= PTE_ATTRINDX(MT_NORMAL_TAGGED);
> +
> + return __pgprot(prot);
> +}
> +
> +pgprot_t vm_get_page_prot(unsigned long vm_flags)
> +{
> + pgprot_t ret = __pgprot(pgprot_val(protection_map[vm_flags &
> + (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
> + 
> pgprot_val(arm64_arch_vm_get_page_prot(vm_flags)));
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(vm_get_page_prot);

Could you write all this in a single function? I think I mentioned it in
a previous series (untested):

pgprot_t vm_get_page_prot(unsigned long vm_flags)
{
pteval_t prot = pgprot_val(protection_map[vm_flags &
   (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]);

if (vm_flags & VM_ARM64_BTI)
prot |= PTE_GP;

/*
 * There are two conditions required for returning a Normal Tagged
 * memory type: (1) the user requested it via PROT_MTE passed to
 * mmap() or mprotect() and (2) the corresponding vma supports MTE. We
 * register (1) as VM_MTE in the vma->vm_flags and (2) as
 * VM_MTE_ALLOWED. Note that the latter can only be set during the
 * mmap() call since mprotect() does not accept MAP_* flags.
 * Checking for VM_MTE only is sufficient since arch_validate_flags()
 * does not permit (VM_MTE & !VM_MTE_ALLOWED).
 */
if (vm_flags & VM_MTE)
prot |= PTE_ATTRINDX(MT_NORMAL_TAGGED);

return __pgprot(prot);
}
EXPORT_SYMBOL(vm_get_page_prot);

With that:

Reviewed-by: Catalin Marinas

Re: [PATCH V4 1/7] mm/mmap: Add new config ARCH_HAS_VM_GET_PAGE_PROT

2022-04-08 Thread Catalin Marinas

On Thu, Apr 07, 2022 at 04:02:45PM +0530, Anshuman Khandual wrote:
> Add a new config ARCH_HAS_VM_GET_PAGE_PROT, which when subscribed enables a
> given platform to define its own vm_get_page_prot() but still utilizing the
> generic protection_map[] array.
> 
> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Suggested-by: Christoph Hellwig 
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Catalin Marinas

Re: False positive kmemleak report for dtb properties names on powerpc

2022-03-23 Thread Catalin Marinas

Hi Ariel,

On Fri, Feb 18, 2022 at 09:45:51PM +0200, Ariel Marcovitch wrote:
> I was running a powerpc 32bit kernel (built using
> qemu_ppc_mpc8544ds_defconfig
> buildroot config, with enabling DEBUGFS+KMEMLEAK+HIGHMEM in the kernel
> config)
> on qemu and invoked the kmemleak scan (twice. for some reason the first time
> wasn't enough).
[...]
> I got 97 leak reports, all similar to the following:
[...]
> memblock_alloc lets kmemleak know about the allocated memory using
> kmemleak_alloc_phys (in mm/memblock.c:memblock_alloc_range_nid()).
> 
> The problem is with the following code (mm/kmemleak.c):
> 
> ```c
> 
> void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, int min_count,
>    gfp_t gfp)
> {
>     if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
>     kmemleak_alloc(__va(phys), size, min_count, gfp);
> }
> 
> ```
> 
> When CONFIG_HIGHMEM is enabled, the pfn of the allocated memory is checked
> against max_low_pfn, to make sure it is not in the HIGHMEM zone.
> 
> However, when called through unflatten_device_tree(), max_low_pfn is not yet
> initialized in powerpc.
> 
> max_low_pfn is initialized (when NUMA is disabled) in
> arch/powerpc/mm/mem.c:mem_topology_setup() which is called only after
> unflatten_device_tree() is called in the same function (setup_arch()).
> 
> Because max_low_pfn is global it is 0 before initialization, so as far as
> kmemleak_alloc_phys() is concerned, every memory is HIGHMEM (: and the
> allocated memory is not tracked by kmemleak, causing references to objects
> allocated later with kmalloc() to be ignored and these objects are marked as
> leaked.

Thanks for digging into this. It looks like the logic doesn't work (not
sure whether it ever worked).

> I actually tried to find out whether this happen on other arches as well,
> and it seems like arm64 also have this problem when dtb is used instead of
> acpi, although I haven't had the chance to confirm this.

arm64 doesn't enable CONFIG_HIGHMEM, so it's not affected.

> I don't suppose I can just shuffle the calls in setup_arch() around, so I
> wanted to hear your opinions first

I think it's better if we change the logic than shuffling the calls.
IIUC MEMBLOCK_ALLOC_ACCESSIBLE means that __va() works on the phys
address return by memblock, so something like below (untested):

-8<--
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 7580baa76af1..f3599e857c13 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1127,8 +1127,7 @@ EXPORT_SYMBOL(kmemleak_no_scan);
 void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, int min_count,
   gfp_t gfp)
 {
-   if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
-   kmemleak_alloc(__va(phys), size, min_count, gfp);
+   kmemleak_alloc(__va(phys), size, min_count, gfp);
 }
 EXPORT_SYMBOL(kmemleak_alloc_phys);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index b12a364f2766..2515bd4331e8 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1397,7 +1397,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t 
size,
 * Skip kmemleak for those places like kasan_init() and
 * early_pgtable_alloc() due to high volume.
 */
-   if (end != MEMBLOCK_ALLOC_NOLEAKTRACE)
+   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
/*
 * The min_count is set to 0 so that memblock allocated
 * blocks are never reported as leaks. This is because many
-8<--

But I'm not sure whether we'd now miss some genuine allocations where
the memblock limit is different from MEMBLOCK_ALLOC_ACCESSIBLE but still
within the lowmem limit. If the above works, we can probably get rid of
some other kmemleak callbacks in the kernel.

Adding Mike for any input on memblock.

-- 
Catalin

Re: [PATCH v1 4/7] arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-21 Thread Catalin Marinas

On Mon, Mar 21, 2022 at 05:44:05PM +, Will Deacon wrote:
> On Mon, Mar 21, 2022 at 04:07:48PM +0100, David Hildenbrand wrote:
> > So the example you gave cannot possibly have that bit set. From what I
> > understand, it should be fine. But I have no real preference: I can also
> > just stick to the original patch, whatever you prefer.
> 
> I think I'd prefer to stay on the safe side and stick with bit 2 as you
> originally proposed. If we need to support crazy numbers of swapfiles
> in future then we can revisit the idea of allocating bit 1.

Sounds fine to me. David, feel free to keep my reviewed-by on the
original patch.

-- 
Catalin

Re: [PATCH v1 4/7] arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-18 Thread Catalin Marinas

On Fri, Mar 18, 2022 at 10:59:10AM +0100, David Hildenbrand wrote:
> diff --git a/arch/arm64/include/asm/pgtable-prot.h 
> b/arch/arm64/include/asm/pgtable-prot.h
> index b1e1b74d993c..fd6ddf14c190 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -14,6 +14,7 @@
>   * Software defined PTE bits definition.
>   */
>  #define PTE_WRITE(PTE_DBM)/* same as DBM (51) */
> +#define PTE_SWP_EXCLUSIVE(PTE_TABLE_BIT)  /* only for swp ptes */
>  #define PTE_DIRTY(_AT(pteval_t, 1) << 55)
>  #define PTE_SPECIAL  (_AT(pteval_t, 1) << 56)
>  #define PTE_DEVMAP   (_AT(pteval_t, 1) << 57)
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 94e147e5456c..c78994073cd0 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -402,6 +402,22 @@ static inline pgprot_t mk_pmd_sect_prot(pgprot_t prot)
>   return __pgprot((pgprot_val(prot) & ~PMD_TABLE_BIT) | PMD_TYPE_SECT);
>  }
>  
> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
> +{
> + return set_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
> +}
> +
> +static inline int pte_swp_exclusive(pte_t pte)
> +{
> + return pte_val(pte) & PTE_SWP_EXCLUSIVE;
> +}
> +
> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> +{
> + return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
> +}
> +
>  #ifdef CONFIG_NUMA_BALANCING
>  /*
>   * See the comment in include/linux/pgtable.h
> @@ -908,7 +924,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct 
> *vma,
>  
>  /*
>   * Encode and decode a swap entry:
> - *   bits 0-1:   present (must be zero)
> + *   bits 0: present (must be zero)
> + *   bits 1: remember PG_anon_exclusive

It looks fine to me.

Reviewed-by: Catalin Marinas

Re: [PATCH v1 4/7] arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-17 Thread Catalin Marinas

On Thu, Mar 17, 2022 at 11:04:18AM +0100, David Hildenbrand wrote:
> On 16.03.22 19:27, Catalin Marinas wrote:
> > On Tue, Mar 15, 2022 at 03:18:34PM +0100, David Hildenbrand wrote:
> >> @@ -909,12 +925,13 @@ static inline pmd_t pmdp_establish(struct 
> >> vm_area_struct *vma,
> >>  /*
> >>   * Encode and decode a swap entry:
> >>   *bits 0-1:   present (must be zero)
> >> - *bits 2-7:   swap type
> >> + *bits 2: remember PG_anon_exclusive
> >> + *bits 3-7:   swap type
> >>   *bits 8-57:  swap offset
> >>   *bit  58:PTE_PROT_NONE (must be zero)
> > 
> > I don't remember exactly why we reserved bits 0 and 1 when, from the
> > hardware perspective, it's sufficient for bit 0 to be 0 and the whole
> > pte becomes invalid. We use bit 1 as the 'table' bit (when 0 at pmd
> > level, it's a huge page) but we shouldn't check for this on a swap
> > entry.
> 
> You mean
> 
> arch/arm64/include/asm/pgtable-hwdef.h:#define PTE_TABLE_BIT
> (_AT(pteval_t, 1) << 1)
> 
> right?

Yes.

> I wonder why it even exists, for arm64 I only spot:
> 
> arch/arm64/include/asm/pgtable.h:#define pte_mkhuge(pte)
> (__pte(pte_val(pte) & ~PTE_TABLE_BIT))
> 
> I don't really see code that sets PTE_TABLE_BIT.
> 
> Similarly, I don't see code that sets 
> PMD_TABLE_BIT/PUD_TABLE_BIT/P4D_TABLE_BIT.
> Most probably setting code is not using the defines,  that's why I'm not 
> finding it.

It gets set as part of P*D_TYPE_TABLE via p*d_populate(). We use the
P*D_TABLE_BIT mostly for checking whether it's a huge page or not (the
arm64 hugetlbpage.c code).

-- 
Catalin

Re: [PATCH v1 4/7] arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-16 Thread Catalin Marinas

On Tue, Mar 15, 2022 at 03:18:34PM +0100, David Hildenbrand wrote:
> diff --git a/arch/arm64/include/asm/pgtable-prot.h 
> b/arch/arm64/include/asm/pgtable-prot.h
> index b1e1b74d993c..62e0ebeed720 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -14,6 +14,7 @@
>   * Software defined PTE bits definition.
>   */
>  #define PTE_WRITE(PTE_DBM)/* same as DBM (51) */
> +#define PTE_SWP_EXCLUSIVE(_AT(pteval_t, 1) << 2)  /* only for swp ptes */

I think we can use bit 1 here.

> @@ -909,12 +925,13 @@ static inline pmd_t pmdp_establish(struct 
> vm_area_struct *vma,
>  /*
>   * Encode and decode a swap entry:
>   *   bits 0-1:   present (must be zero)
> - *   bits 2-7:   swap type
> + *   bits 2: remember PG_anon_exclusive
> + *   bits 3-7:   swap type
>   *   bits 8-57:  swap offset
>   *   bit  58:PTE_PROT_NONE (must be zero)

I don't remember exactly why we reserved bits 0 and 1 when, from the
hardware perspective, it's sufficient for bit 0 to be 0 and the whole
pte becomes invalid. We use bit 1 as the 'table' bit (when 0 at pmd
level, it's a huge page) but we shouldn't check for this on a swap
entry.

-- 
Catalin

Re: [PATCH V3 05/30] arm64/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-03 Thread Catalin Marinas

Hi Anshuman,

On Mon, Feb 28, 2022 at 04:17:28PM +0530, Anshuman Khandual wrote:
> +static inline pgprot_t __vm_get_page_prot(unsigned long vm_flags)
> +{
> + switch (vm_flags & (VM_READ | VM_WRITE | VM_EXEC | VM_SHARED)) {
> + case VM_NONE:
> + return PAGE_NONE;
> + case VM_READ:
> + case VM_WRITE:
> + case VM_WRITE | VM_READ:
> + return PAGE_READONLY;
> + case VM_EXEC:
> + return PAGE_EXECONLY;
> + case VM_EXEC | VM_READ:
> + case VM_EXEC | VM_WRITE:
> + case VM_EXEC | VM_WRITE | VM_READ:
> + return PAGE_READONLY_EXEC;
> + case VM_SHARED:
> + return PAGE_NONE;
> + case VM_SHARED | VM_READ:
> + return PAGE_READONLY;
> + case VM_SHARED | VM_WRITE:
> + case VM_SHARED | VM_WRITE | VM_READ:
> + return PAGE_SHARED;
> + case VM_SHARED | VM_EXEC:
> + return PAGE_EXECONLY;
> + case VM_SHARED | VM_EXEC | VM_READ:
> + return PAGE_READONLY_EXEC;
> + case VM_SHARED | VM_EXEC | VM_WRITE:
> + case VM_SHARED | VM_EXEC | VM_WRITE | VM_READ:
> + return PAGE_SHARED_EXEC;
> + default:
> + BUILD_BUG();
> + }
> +}

I'd say ack for trying to get of the extra arch_vm_get_page_prot() and
arch_filter_pgprot() but, TBH, I'm not so keen on the outcome. I haven't
built the code to see what's generated but I suspect it's no significant
improvement. As for the code readability, the arm64 parts don't look
much better either. The only advantage with this patch is that all
functions have been moved under arch/arm64.

I'd keep most architectures that don't have own arch_vm_get_page_prot()
or arch_filter_pgprot() unchanged and with a generic protection_map[]
array. For architectures that need fancier stuff, add a
CONFIG_ARCH_HAS_VM_GET_PAGE_PROT (as you do) and allow them to define
vm_get_page_prot() while getting rid of arch_vm_get_page_prot() and
arch_filter_pgprot(). I think you could also duplicate protection_map[]
for architectures with own vm_get_page_prot() (make it static) and
#ifdef it out in mm/mmap.c.

If later you have more complex needs or a switch statement generates
better code, go for it, but for this series I'd keep things simple, only
focus on getting rid of arch_vm_get_page_prot() and
arch_filter_pgprot().

If I grep'ed correctly, there are only 4 architectures that have own
arch_vm_get_page_prot() (arm64, powerpc, sparc, x86) and 2 that have own
arch_filter_pgprot() (arm64, x86). Try to only change these for the time
being, together with the other generic mm cleanups you have in this
series. I think there are a couple more that touch protection_map[]
(arm, m68k). You can leave the generic protection_map[] global if the
arch does not select ARCH_HAS_VM_GET_PAGE_PROT.

> +static pgprot_t arm64_arch_filter_pgprot(pgprot_t prot)
> +{
> + if (cpus_have_const_cap(ARM64_HAS_EPAN))
> + return prot;
> +
> + if (pgprot_val(prot) != pgprot_val(PAGE_EXECONLY))
> + return prot;
> +
> + return PAGE_READONLY_EXEC;
> +}
> +
> +static pgprot_t arm64_arch_vm_get_page_prot(unsigned long vm_flags)
> +{
> + pteval_t prot = 0;
> +
> + if (vm_flags & VM_ARM64_BTI)
> + prot |= PTE_GP;
> +
> + /*
> +  * There are two conditions required for returning a Normal Tagged
> +  * memory type: (1) the user requested it via PROT_MTE passed to
> +  * mmap() or mprotect() and (2) the corresponding vma supports MTE. We
> +  * register (1) as VM_MTE in the vma->vm_flags and (2) as
> +  * VM_MTE_ALLOWED. Note that the latter can only be set during the
> +  * mmap() call since mprotect() does not accept MAP_* flags.
> +  * Checking for VM_MTE only is sufficient since arch_validate_flags()
> +  * does not permit (VM_MTE & !VM_MTE_ALLOWED).
> +  */
> + if (vm_flags & VM_MTE)
> + prot |= PTE_ATTRINDX(MT_NORMAL_TAGGED);
> +
> + return __pgprot(prot);
> +}
> +
> +pgprot_t vm_get_page_prot(unsigned long vm_flags)
> +{
> + pgprot_t ret = __pgprot(pgprot_val(__vm_get_page_prot(vm_flags)) |
> + pgprot_val(arm64_arch_vm_get_page_prot(vm_flags)));
> +
> + return arm64_arch_filter_pgprot(ret);
> +}

If we kept the array, we can have everything in a single function
(untested and with my own comments for future changes):

pgprot_t vm_get_page_prot(unsigned long vm_flags)
{
pgprot_t prot = __pgprot(pgprot_val(protection_map[vm_flags &
(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]));

/*
 * We could get rid of this test if we updated protection_map[]
 * to turn exec-only into read-exec during boot.
 */
if (!cpus_have_const_cap(ARM64_HAS_EPAN) &&
pgprot_val(prot) == pgprot_val(PAGE_EXECONLY))
prot = PAGE_READONLY_EXEC;

if (vm_flags & VM_ARM64_BTI)
prot != PTE_GP;

/*
 * We can get rid of the

Re: [PATCH V2] mm: Merge pte_mkhuge() call into arch_make_huge_pte()

2022-02-03 Thread Catalin Marinas

On Thu, Feb 03, 2022 at 09:27:49AM +0530, Anshuman Khandual wrote:
> Each call into pte_mkhuge() is invariably followed by arch_make_huge_pte().
> Instead arch_make_huge_pte() can accommodate pte_mkhuge() at the beginning.
> This updates generic fallback stub for arch_make_huge_pte() and available
> platforms definitions. This makes huge pte creation much cleaner and easier
> to follow.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: "David S. Miller" 
> Cc: Mike Kravetz 
> Cc: Andrew Morton 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Reviewed-by: Christophe Leroy 
> Acked-by: Mike Kravetz 
> Signed-off-by: Anshuman Khandual 

Acking v2 as well:

Acked-by: Catalin Marinas

Re: [PATCH] mm: Merge pte_mkhuge() call into arch_make_huge_pte()

2022-02-03 Thread Catalin Marinas

On Wed, Feb 02, 2022 at 11:08:06AM +0530, Anshuman Khandual wrote:
> Each call into pte_mkhuge() is invariably followed by arch_make_huge_pte().
> Instead arch_make_huge_pte() can accommodate pte_mkhuge() at the beginning.
> This updates generic fallback stub for arch_make_huge_pte() and available
> platforms definitions. This makes huge pte creation much cleaner and easier
> to follow.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: "David S. Miller" 
> Cc: Mike Kravetz 
> Cc: Andrew Morton 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v6 04/14] mm, hugetlbfs: Allow for "high" userspace addresses

2022-01-04 Thread Catalin Marinas

On Fri, Dec 17, 2021 at 10:27:28AM +, Christophe Leroy wrote:
> This is a complement of f6795053dac8 ("mm: mmap: Allow for "high"
> userspace addresses") for hugetlb.
> 
> This patch adds support for "high" userspace addresses that are
> optionally supported on the system and have to be requested via a hint
> mechanism ("high" addr parameter to mmap).
> 
> Architectures such as powerpc and x86 achieve this by making changes to
> their architectural versions of hugetlb_get_unmapped_area() function.
> However, arm64 uses the generic version of that function.
> 
> So take into account arch_get_mmap_base() and arch_get_mmap_end() in
> hugetlb_get_unmapped_area(). To allow that, move those two macros
> out of mm/mmap.c into include/linux/sched/mm.h
> 
> If these macros are not defined in architectural code then they default
> to (TASK_SIZE) and (base) so should not introduce any behavioural
> changes to architectures that do not define them.
> 
> For the time being, only ARM64 is affected by this change.
> 
> Signed-off-by: Christophe Leroy 
> Cc: Steve Capper 
> Cc: Will Deacon 
> Cc: Catalin Marinas 
> ---
>  fs/hugetlbfs/inode.c | 9 +
>  include/linux/sched/mm.h | 8 
>  mm/mmap.c| 8 
>  3 files changed, 13 insertions(+), 12 deletions(-)

This would be an ABI change but I'm fine with it. Basically with this
patch, getting hugetblfs addresses above 48-bit require explicit hint
passed to mmap().

I wonder whether we should add a fixes tag (or at least the cc stable):

Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses")
Cc:  # 5.0.x

I think the original commit should have changed
hugetlb_get_unmapped_area() to have the same behaviour as
arch_get_unmapped_area(). Steve, any thoughts?

FWIW,

Reviewed-by: Catalin Marinas

Re: [PATCH v6 03/14] mm: Add len and flags parameters to arch_get_mmap_end()

2022-01-04 Thread Catalin Marinas

On Fri, Dec 17, 2021 at 10:27:23AM +, Christophe Leroy wrote:
> Powerpc needs flags and len to make decision on arch_get_mmap_end().
> 
> So add them as parameters to arch_get_mmap_end().
> 
> Signed-off-by: Christophe Leroy 
> Cc: Steve Capper 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> ---
>  arch/arm64/include/asm/processor.h | 4 ++--
>  mm/mmap.c  | 6 +++---
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/processor.h 
> b/arch/arm64/include/asm/processor.h
> index 6f41b65f9962..88c696350ace 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -91,8 +91,8 @@
>  #endif /* CONFIG_COMPAT */
>  
>  #ifndef CONFIG_ARM64_FORCE_52BIT
> -#define arch_get_mmap_end(addr) ((addr > DEFAULT_MAP_WINDOW) ? TASK_SIZE :\
> - DEFAULT_MAP_WINDOW)
> +#define arch_get_mmap_end(addr, len, flags) ((addr > DEFAULT_MAP_WINDOW) ? 
> TASK_SIZE :\
> +
> DEFAULT_MAP_WINDOW)

Nitpick, move the "((addr > " on the following line. Either way,

Acked-by: Catalin Marinas

Re: [PATCH v4 07/10] powerpc/mm: Use generic_hugetlb_get_unmapped_area()

2021-12-16 Thread Catalin Marinas

On Thu, Dec 16, 2021 at 05:13:47PM +, Christophe Leroy wrote:
> Le 09/12/2021 à 11:02, Nicholas Piggin a écrit :
> > Excerpts from Christophe Leroy's message of December 9, 2021 3:18 am:
> >> Use the generic version of arch_hugetlb_get_unmapped_area()
> >> which is now available at all time.
> >>
> >> Signed-off-by: Christophe Leroy 
> >> ---
> >>   arch/powerpc/include/asm/book3s/64/hugetlb.h |  4 --
> >>   arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 55 
> >>   arch/powerpc/mm/hugetlbpage.c|  4 +-
> >>   3 files changed, 1 insertion(+), 62 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
> >> b/arch/powerpc/include/asm/book3s/64/hugetlb.h
> >> index 12e150e615b7..b37a28f62cf6 100644
> >> --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
> >> +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
> >> @@ -8,10 +8,6 @@
> >>*/
> >>   void radix__flush_hugetlb_page(struct vm_area_struct *vma, unsigned long 
> >> vmaddr);
> >>   void radix__local_flush_hugetlb_page(struct vm_area_struct *vma, 
> >> unsigned long vmaddr);
> >> -extern unsigned long
> >> -radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> >> -  unsigned long len, unsigned long pgoff,
> >> -  unsigned long flags);
> >>   
> >>   extern void radix__huge_ptep_modify_prot_commit(struct vm_area_struct 
> >> *vma,
> >>unsigned long addr, pte_t *ptep,
> >> diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c 
> >> b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
> >> index 23d3e08911d3..d2fb776febb4 100644
> >> --- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
> >> +++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
> >> @@ -41,61 +41,6 @@ void radix__flush_hugetlb_tlb_range(struct 
> >> vm_area_struct *vma, unsigned long st
> >>radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
> >>   }
> >>   
> >> -/*
> >> - * A vairant of hugetlb_get_unmapped_area doing topdown search
> >> - * FIXME!! should we do as x86 does or non hugetlb area does ?
> >> - * ie, use topdown or not based on mmap_is_legacy check ?
> >> - */
> >> -unsigned long
> >> -radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> >> -  unsigned long len, unsigned long pgoff,
> >> -  unsigned long flags)
> >> -{
> >> -  struct mm_struct *mm = current->mm;
> >> -  struct vm_area_struct *vma;
> >> -  struct hstate *h = hstate_file(file);
> >> -  int fixed = (flags & MAP_FIXED);
> >> -  unsigned long high_limit;
> >> -  struct vm_unmapped_area_info info;
> >> -
> >> -  high_limit = DEFAULT_MAP_WINDOW;
> >> -  if (addr >= high_limit || (fixed && (addr + len > high_limit)))
> >> -  high_limit = TASK_SIZE;
> > 
> > I wonder if generic hugetlb_get_unmapped_area needs to have the
> > arch_get_mmap_end() added.
> > 
> > arm64 has arch_get_mmap_end() and !HAVE_ARCH_HUGETLB_UNMAPPED_AREA so
> > it looks like it has broken large address hint logic for hugetlbfs
> > mappings? x86-64 defines their own and does the same hinting for
> > normal and hugetlbfs mmap.
> > 
> > If we had that and defied arch_get_mmap_end(), then this patch should
> > work.
> > 
> 
> As far as I can see, hugetlb_get_unmapped_area() variants used to be 
> very similar to get_unmapped_area() until commit 1be7107fbe18 ("mm: 
> larger stack guard gap, between vmas") and commit f6795053dac8 ("mm: 
> mmap: Allow for "high" userspace addresses")
> 
> I see no reason why those changes couldn't apply to 
> hugetlb_get_unmapped_area() as well.
> 
> Need to know what ARM64 think about it thought. Will, Catalin, any opinion ?

I think we should have fixed hugetlb_get_unmapped_area() as well when we
added support for 52-bit VA. The reason for commit f6795053dac8 was to
prevent normal mmap() from returning addresses above 48-bit by default
as some user-space had hard assumptions about this.

It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
but I doubt anyone would notice. It's more likely that the current
behaviour would cause issues, so I'd rather have them consistent.

-- 
Catalin

Re: [PATCH RFC 1/4] mm: percpu: Generalize percpu related config

2021-12-03 Thread Catalin Marinas

On Sun, Nov 21, 2021 at 05:35:54PM +0800, Kefeng Wang wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index c4207cf9bb17..4ff73299f8a9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1135,6 +1135,10 @@ config NUMA
>   select GENERIC_ARCH_NUMA
>   select ACPI_NUMA if ACPI
>   select OF_NUMA
> + select HAVE_SETUP_PER_CPU_AREA
> + select NEED_PER_CPU_EMBED_FIRST_CHUNK
> + select NEED_PER_CPU_PAGE_FIRST_CHUNK
> + select USE_PERCPU_NUMA_NODE_ID
>   help
> Enable NUMA (Non-Uniform Memory Access) support.
>  
> @@ -1151,22 +1155,6 @@ config NODES_SHIFT
> Specify the maximum number of NUMA Nodes available on the target
> system.  Increases memory reserved to accommodate various tables.
>  
> -config USE_PERCPU_NUMA_NODE_ID
> - def_bool y
> - depends on NUMA
> -
> -config HAVE_SETUP_PER_CPU_AREA
> - def_bool y
> - depends on NUMA
> -
> -config NEED_PER_CPU_EMBED_FIRST_CHUNK
> - def_bool y
> - depends on NUMA
> -
> -config NEED_PER_CPU_PAGE_FIRST_CHUNK
> - def_bool y
> - depends on NUMA
> -
>  source "kernel/Kconfig.hz"
>  
>  config ARCH_SPARSEMEM_ENABLE

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v2 11/45] arm64: Use do_kernel_power_off()

2021-11-01 Thread Catalin Marinas

On Thu, Oct 28, 2021 at 12:16:41AM +0300, Dmitry Osipenko wrote:
> Kernel now supports chained power-off handlers. Use do_kernel_power_off()
> that invokes chained power-off handlers. It also invokes legacy
> pm_power_off() for now, which will be removed once all drivers will
> be converted to the new power-off API.
> 
> Signed-off-by: Dmitry Osipenko 

Acked-by: Catalin Marinas

Re: [RFC PATCH 5/8] sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y

2021-09-21 Thread Catalin Marinas

On Tue, Sep 14, 2021 at 02:10:33PM +0200, Ard Biesheuvel wrote:
> THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
> causes some issues on architectures that define raw_smp_processor_id()
> in terms of this field, due to the fact that #include'ing linux/sched.h
> to get at struct task_struct is problematic in terms of circular
> dependencies.
> 
> Given that thread_info and task_struct are the same data structure
> anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
> access to the type definition of struct thread_info is sufficient to
> reference the CPU number of the current task.
> 
> Signed-off-by: Ard Biesheuvel 

Acked-by: Catalin Marinas

Re: [RFC PATCH 1/8] arm64: add CPU field to struct thread_info

2021-09-16 Thread Catalin Marinas

On Tue, Sep 14, 2021 at 02:10:29PM +0200, Ard Biesheuvel wrote:
> The CPU field will be moved back into thread_info even when
> THREAD_INFO_IN_TASK is enabled, so add it back to arm64's definition of
> struct thread_info.
> 
> Signed-off-by: Ard Biesheuvel 

Acked-by: Catalin Marinas

Re: [PATCH 1/3] arch: Export machine_restart() instances so they can be called from modules

2021-08-05 Thread Catalin Marinas

On Thu, Aug 05, 2021 at 08:50:30AM +0100, Lee Jones wrote:
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index b4bb67f17a2ca..cf89ce91d7145 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -212,6 +212,7 @@ void machine_restart(char *cmd)
>   printk("Reboot failed -- System halted\n");
>   while (1);
>  }
> +EXPORT_SYMBOL(machine_restart);

Should we make this EXPORT_SYMBOL_GPL? I suppose it's not for general
use by out of tree drivers and it matches the other pm_power_off symbol
we export in this file.

Either way:

Acked-by: Catalin Marinas

Re: [PATCH v2] arch: vdso: remove if-conditionals of $(c-gettimeofday-y)

2021-08-03 Thread Catalin Marinas

On Sat, Jul 31, 2021 at 03:00:20PM +0900, Masahiro Yamada wrote:
> arm, arm64, csky, mips, powerpc always select GENERIC_GETTIMEOFDAY,
> hence $(gettimeofday-y) never becomes empty.
> 
> riscv conditionally selects GENERIC_GETTIMEOFDAY when MMU=y && 64BIT=y,
> but arch/riscv/kernel/vdso/vgettimeofday.o is built only under that
> condition. So, you can always define CFLAGS_vgettimeofday.o
> 
> Remove all the meaningless conditionals.
> 
> Signed-off-by: Masahiro Yamada 

Acked-by: Catalin Marinas

Re: [PATCH 2/3] trace: refactor TRACE_IRQFLAGS_SUPPORT in Kconfig

2021-08-02 Thread Catalin Marinas

On Sat, Jul 31, 2021 at 02:22:32PM +0900, Masahiro Yamada wrote:
> Make architectures select TRACE_IRQFLAGS_SUPPORT instead of
> having many defines.
> 
> Signed-off-by: Masahiro Yamada 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH v1 04/12] mm/memory_hotplug: remove nid parameter from arch_remove_memory()

2021-06-08 Thread Catalin Marinas

On Mon, Jun 07, 2021 at 09:54:22PM +0200, David Hildenbrand wrote:
> The parameter is unused, let's remove it.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: "H. Peter Anvin" 
> Cc: Andrew Morton 
> Cc: Anshuman Khandual 
> Cc: Ard Biesheuvel 
> Cc: Mike Rapoport 
> Cc: Nicholas Piggin 
> Cc: Pavel Tatashin 
> Cc: Baoquan He 
> Cc: Laurent Dufour 
> Cc: Sergei Trofimovich 
> Cc: Kefeng Wang 
> Cc: Michel Lespinasse 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: Thiago Jung Bauermann 
> Cc: Joe Perches 
> Cc: Pierre Morel 
> Cc: Jia He 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-i...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Signed-off-by: David Hildenbrand 
> ---
>  arch/arm64/mm/mmu.c| 3 +--

Acked-by: Catalin Marinas

Re: [PATCH] mm: generalize ZONE_[DMA|DMA32]

2021-05-27 Thread Catalin Marinas

On Thu, May 27, 2021 at 10:30:47PM +0800, Kefeng Wang wrote:
> ZONE_[DMA|DMA32] configs have duplicate definitions on platforms
> that subscribe them. Instead, just make them generic options which
> can be selected on applicable platforms.
> 
> Also only x86/arm64 architectures could enable both ZONE_DMA and
> ZONE_DMA32 if EXPERT, add ARCH_HAS_ZONE_DMA_SET to make dma zone
> configurable and visible on the two architectures.
> 
> Cc: Andrew Morton  
> Cc: Catalin Marinas  
> Cc: Will Deacon  
> Cc: Geert Uytterhoeven  
> Cc: Thomas Bogendoerfer  
> Cc: "David S. Miller"  
> Cc: Ingo Molnar  
> Cc: Borislav Petkov  
> Cc: Palmer Dabbelt 
> Cc: Richard Henderson  
> Cc: Russell King 
> Signed-off-by: Kefeng Wang 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH 0/6] mm: some config cleanups

2021-03-19 Thread Catalin Marinas

On Tue, Mar 09, 2021 at 02:03:04PM +0530, Anshuman Khandual wrote:
> This series contains config cleanup patches which reduces code duplication
> across platforms and also improves maintainability. There is no functional
> change intended with this series. This has been boot tested on arm64 but
> only build tested on some other platforms.
> 
> This applies on 5.12-rc2
> 
> Cc: x...@kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-m...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> 
> Anshuman Khandual (6):
>   mm: Generalize ARCH_HAS_CACHE_LINE_SIZE
>   mm: Generalize SYS_SUPPORTS_HUGETLBFS (rename as ARCH_SUPPORTS_HUGETLBFS)
>   mm: Generalize ARCH_ENABLE_MEMORY_[HOTPLUG|HOTREMOVE]
>   mm: Drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION
>   mm: Drop redundant ARCH_ENABLE_SPLIT_PMD_PTLOCK
>   mm: Drop redundant HAVE_ARCH_TRANSPARENT_HUGEPAGE
> 
>  arch/arc/Kconfig   |  9 ++--
>  arch/arm/Kconfig   | 10 ++---
>  arch/arm64/Kconfig | 30 ++

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH V2] mm/memtest: Add ARCH_USE_MEMTEST

2021-03-19 Thread Catalin Marinas

On Mon, Mar 01, 2021 at 10:02:06AM +0530, Anshuman Khandual wrote:
> early_memtest() does not get called from all architectures. Hence enabling
> CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel command line
> option might not trigger the memory pattern tests as would be expected in
> normal circumstances. This situation is misleading.
> 
> The change here prevents the above mentioned problem after introducing a
> new config option ARCH_USE_MEMTEST that should be subscribed on platforms
> that call early_memtest(), in order to enable the config CONFIG_MEMTEST.
> Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would
> not be tested anyway.
> 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Thomas Bogendoerfer 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Chris Zankel 
> Cc: Max Filippov 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-m...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Reviewed-by: Max Filippov 
> Signed-off-by: Anshuman Khandual 

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Catalin Marinas

On Mon, Nov 30, 2020 at 07:50:25PM +, ZHIZHIKIN Andrey wrote:
> From Krzysztof Kozlowski :
> > On Mon, Nov 30, 2020 at 03:21:33PM +, Andrey Zhizhikin wrote:
> > > Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
> > > unused") removed geenric_bl driver from the tree, together with
> > > corresponding config option.
> > >
> > > Remove BACKLIGHT_GENERIC config item from all ARM configurations.
> > >
> > > Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it
> > > is unused")
> > > Cc: Sam Ravnborg 
> > > Signed-off-by: Andrey Zhizhikin
> > > 
> > > ---
> > >  arch/arm/configs/at91_dt_defconfig| 1 -
> > >  arch/arm/configs/cm_x300_defconfig| 1 -
> > >  arch/arm/configs/colibri_pxa300_defconfig | 1 -
> > >  arch/arm/configs/jornada720_defconfig | 1 -
> > >  arch/arm/configs/magician_defconfig   | 1 -
> > >  arch/arm/configs/mini2440_defconfig   | 1 -
> > >  arch/arm/configs/omap2plus_defconfig  | 1 -
> > >  arch/arm/configs/pxa3xx_defconfig | 1 -
> > >  arch/arm/configs/qcom_defconfig   | 1 -
> > >  arch/arm/configs/sama5_defconfig  | 1 -
> > >  arch/arm/configs/sunxi_defconfig  | 1 -
> > >  arch/arm/configs/tegra_defconfig  | 1 -
> > >  arch/arm/configs/u8500_defconfig  | 1 -
> > >  13 files changed, 13 deletions(-)
> > 
> > You need to send it to arm-soc maintainers, otherwise no one might feel
> > responsible enough to pick it up.
> 
> Good point, thanks a lot!
> 
> I was not aware of the fact that there is a separate ML that should
> receive patches targeted ARM SOCs. Can you (or anyone else) please
> share it, so I can re-send it there as well?

It's not a mailing list as such (with archives etc.), just an alias to
the arm-soc maintainers: a...@kernel.org.

> > Reviewed-by: Krzysztof Kozlowski 
> > 
> > +CC Arnd and Olof,
> > 
> > Dear Arnd and Olof,
> > 
> > Maybe it is worth to add arm-soc entry to the MAINTAINERS file?
> > Otherwise how one could get your email address? Not mentioning the
> > secret-soc address. :)

I tried to convince them before, it didn't work. I guess they don't like
to be spammed ;). Or rather, SoC-specific patches, even to defconfig,
should go through the specific SoC maintainers. However, there are
occasional defconfig patches which are more generic or affecting
multiple SoCs. I just ignore them as the arm64 defconfig is usually
handled by the arm-soc folk (when I need a defconfig change, I go for
arch/arm64/Kconfig directly ;)).

Anyway, I still think that we should add a MAINTAINERS entry for
arch/arm64/configs/defconfig and arch/arm64/Kconfig.platforms.

-- 
Catalin

Re: [PATCH v8 05/12] mm: HUGE_VMAP arch support cleanup

2020-12-01 Thread Catalin Marinas

On Sun, Nov 29, 2020 at 01:25:52AM +1000, Nicholas Piggin wrote:
> This changes the awkward approach where architectures provide init
> functions to determine which levels they can provide large mappings for,
> to one where the arch is queried for each call.
> 
> This removes code and indirection, and allows constant-folding of dead
> code for unsupported levels.
> 
> This also adds a prot argument to the arch query. This is unused
> currently but could help with some architectures (e.g., some powerpc
> processors can't map uncacheable memory with large pages).
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: "H. Peter Anvin" 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/arm64/include/asm/vmalloc.h |  8 +++
>  arch/arm64/mm/mmu.c  | 10 +--

For arm64:

Acked-by: Catalin Marinas

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-15 Thread Catalin Marinas

On Wed, Oct 14, 2020 at 03:21:16PM -0600, Khalid Aziz wrote:
> On 10/13/20 3:16 AM, Catalin Marinas wrote:
> > On Mon, Oct 12, 2020 at 01:14:50PM -0600, Khalid Aziz wrote:
> >> On 10/12/20 11:22 AM, Catalin Marinas wrote:
> >>> On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
> >>>> On 10/10/20 5:09 AM, Catalin Marinas wrote:
> >>>>> I still think sparc should avoid walking the vmas in
> >>>>> arch_validate_prot(). The core code already has the vmas, though not
> >>>>> when calling arch_validate_prot(). That's one of the reasons I added
> >>>>> arch_validate_flags() with the MTE patches. For sparc, this could be
> >>>>> (untested, just copied the arch_validate_prot() code):
> >>>>
> >>>> I am little uncomfortable with the idea of validating protection bits
> >>>> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
> >>>> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
> >>>> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
> >>>> on earlier VMAs. This will apply to protection bits other than ADI as
> >>>> well of course. This becomes a partial failure of mprotect() call. I
> >>>> think it should be all or nothing with mprotect() - when one calls
> >>>> mprotect() from userspace, either the entire address range passed in
> >>>> gets its protection bits updated or none of it does. That requires
> >>>> validating protection bits upfront or undoing what earlier iterations of
> >>>> VMA walk loop might have done.
> >>>
> >>> I thought the same initially but mprotect() already does this with the
> >>> VM_MAY* flag checking. If you ask it for an mprotect() that crosses
> >>> multiple vmas and one of them fails, it doesn't roll back the changes to
> >>> the prior ones. I considered that a similar approach is fine for MTE
> >>> (it's most likely a user error).
> >>
> >> You are right about the current behavior with VM_MAY* flags, but that is
> >> not the right behavior. Adding more cases to this just perpetuates
> >> incorrect behavior. It is not easy to roll back changes after VMAs have
> >> potentially been split/merged which is probably why the current code
> >> simply throws in the towel and returns with partially modified address
> >> space. It is lot easier to do all the checks upfront and then proceed or
> >> not proceed with modifying VMAs. One approach might be to call
> >> arch_validate_flags() in a loop before modifying VMAs and walk all VMAs
> >> with a read lock held. Current code also bails out with ENOMEM if it
> >> finds a hole in the address range and leaves any modifications already
> >> made in place. This is another case where a hole could have been
> >> detected earlier.
> > 
> > This should be ideal indeed though with the risk of breaking the current
> > ABI (FWIW, FreeBSD seems to do a first pass to check for violations:
> > https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_map.c#L2630).
> 
> I am not sure I understand where the ABI breakage would be. Are we aware
> of apps that intentionally modify address space partially using the
> current code?

I hope there aren't any but you never know until you make the change and
someone complains. Arguably, such user code is already broken since
mprotect() doesn't even tell where the failure occurred.

> What FreeBSD does seems like a reasonable thing to do. Any way first
> thing to do is to update sparc to use arch_validate_flags() and update
> sparc_validate_prot() to not peek into vma without lock.

If you go for arch_validate_flags(), I think sparc_validate_prot()
doesn't need the vma at all.

BTW, on the ADI topic, I think you have a race in do_swap_page() since
set_pte_at() is called before arch_do_swap_page(). So a thread in the
same process would see the new mapping but the tags have not been
updated yet. Unless sparc relies on the new user pte to be set, I think
you can just swap the two calls.

-- 
Catalin

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-13 Thread Catalin Marinas

On Mon, Oct 12, 2020 at 01:14:50PM -0600, Khalid Aziz wrote:
> On 10/12/20 11:22 AM, Catalin Marinas wrote:
> > On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
> >> On 10/10/20 5:09 AM, Catalin Marinas wrote:
> >>> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
> >>>> On 10/7/20 1:39 AM, Jann Horn wrote:
> >>>>> arch_validate_prot() is a hook that can validate whether a given set of
> >>>>> protection flags is valid in an mprotect() operation. It is given the 
> >>>>> set
> >>>>> of protection flags and the address being modified.
> >>>>>
> >>>>> However, the address being modified can currently not actually be used 
> >>>>> in
> >>>>> a meaningful way because:
> >>>>>
> >>>>> 1. Only the address is given, but not the length, and the operation can
> >>>>>span multiple VMAs. Therefore, the callee can't actually tell which
> >>>>>virtual address range, or which VMAs, are being targeted.
> >>>>> 2. The mmap_lock is not held, meaning that if the callee were to check
> >>>>>the VMA at @addr, that VMA would be unrelated to the one the
> >>>>>operation is performed on.
> >>>>>
> >>>>> Currently, custom arch_validate_prot() handlers are defined by
> >>>>> arm64, powerpc and sparc.
> >>>>> arm64 and powerpc don't care about the address range, they just check 
> >>>>> the
> >>>>> flags against CPU support masks.
> >>>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't 
> >>>>> take
> >>>>> the mmap_lock.
> >>>>>
> >>>>> Change the function signature to also take a length, and move the
> >>>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> >>> [...]
> >>>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
> >>>> is made without holding mmap_lock. Lock is not acquired until
> >>>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
> >>>> uncomfortable forcing all implementations of validate_prot to require
> >>>> mmap_lock be held when non-sparc implementations do not have such need
> >>>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
> >>>> solves a current problem.
> >>>
> >>> I still think sparc should avoid walking the vmas in
> >>> arch_validate_prot(). The core code already has the vmas, though not
> >>> when calling arch_validate_prot(). That's one of the reasons I added
> >>> arch_validate_flags() with the MTE patches. For sparc, this could be
> >>> (untested, just copied the arch_validate_prot() code):
> >>
> >> I am little uncomfortable with the idea of validating protection bits
> >> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
> >> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
> >> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
> >> on earlier VMAs. This will apply to protection bits other than ADI as
> >> well of course. This becomes a partial failure of mprotect() call. I
> >> think it should be all or nothing with mprotect() - when one calls
> >> mprotect() from userspace, either the entire address range passed in
> >> gets its protection bits updated or none of it does. That requires
> >> validating protection bits upfront or undoing what earlier iterations of
> >> VMA walk loop might have done.
> > 
> > I thought the same initially but mprotect() already does this with the
> > VM_MAY* flag checking. If you ask it for an mprotect() that crosses
> > multiple vmas and one of them fails, it doesn't roll back the changes to
> > the prior ones. I considered that a similar approach is fine for MTE
> > (it's most likely a user error).
> 
> You are right about the current behavior with VM_MAY* flags, but that is
> not the right behavior. Adding more cases to this just perpetuates
> incorrect behavior. It is not easy to roll back changes after VMAs have
> potentially been split/merged which is probably why the current code
> simply throws in the towel and returns with partially modified address
> space. It is lot easier to do all the checks upfront and then proceed or
> not proceed with modifying VMAs. One approach might be to call
> arch_validate_flags() in a loop before modifying VMAs and walk all VMAs
> with a read lock held. Current code also bails out with ENOMEM if it
> finds a hole in the address range and leaves any modifications already
> made in place. This is another case where a hole could have been
> detected earlier.

This should be ideal indeed though with the risk of breaking the current
ABI (FWIW, FreeBSD seems to do a first pass to check for violations:
https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_map.c#L2630).

However, I'm not sure it's worth the hassle. Do we expect the user to
call mprotect() across multiple mixed type mappings while relying on no
change if an error is returned? We should probably at least document the
current behaviour in the mprotect man page.

-- 
Catalin

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-12 Thread Catalin Marinas

On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
> On 10/10/20 5:09 AM, Catalin Marinas wrote:
> > On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
> >> On 10/7/20 1:39 AM, Jann Horn wrote:
> >>> arch_validate_prot() is a hook that can validate whether a given set of
> >>> protection flags is valid in an mprotect() operation. It is given the set
> >>> of protection flags and the address being modified.
> >>>
> >>> However, the address being modified can currently not actually be used in
> >>> a meaningful way because:
> >>>
> >>> 1. Only the address is given, but not the length, and the operation can
> >>>span multiple VMAs. Therefore, the callee can't actually tell which
> >>>virtual address range, or which VMAs, are being targeted.
> >>> 2. The mmap_lock is not held, meaning that if the callee were to check
> >>>the VMA at @addr, that VMA would be unrelated to the one the
> >>>operation is performed on.
> >>>
> >>> Currently, custom arch_validate_prot() handlers are defined by
> >>> arm64, powerpc and sparc.
> >>> arm64 and powerpc don't care about the address range, they just check the
> >>> flags against CPU support masks.
> >>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> >>> the mmap_lock.
> >>>
> >>> Change the function signature to also take a length, and move the
> >>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> > [...]
> >> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
> >> is made without holding mmap_lock. Lock is not acquired until
> >> vm_mmap_pgoff(). This variance is uncomfortable but I am more
> >> uncomfortable forcing all implementations of validate_prot to require
> >> mmap_lock be held when non-sparc implementations do not have such need
> >> yet. Since do_mmap2() is in powerpc specific code, for now this patch
> >> solves a current problem.
> > 
> > I still think sparc should avoid walking the vmas in
> > arch_validate_prot(). The core code already has the vmas, though not
> > when calling arch_validate_prot(). That's one of the reasons I added
> > arch_validate_flags() with the MTE patches. For sparc, this could be
> > (untested, just copied the arch_validate_prot() code):
> 
> I am little uncomfortable with the idea of validating protection bits
> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
> on earlier VMAs. This will apply to protection bits other than ADI as
> well of course. This becomes a partial failure of mprotect() call. I
> think it should be all or nothing with mprotect() - when one calls
> mprotect() from userspace, either the entire address range passed in
> gets its protection bits updated or none of it does. That requires
> validating protection bits upfront or undoing what earlier iterations of
> VMA walk loop might have done.

I thought the same initially but mprotect() already does this with the
VM_MAY* flag checking. If you ask it for an mprotect() that crosses
multiple vmas and one of them fails, it doesn't roll back the changes to
the prior ones. I considered that a similar approach is fine for MTE
(it's most likely a user error).

-- 
Catalin

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-10 Thread Catalin Marinas

Hi Khalid,

On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
> On 10/7/20 1:39 AM, Jann Horn wrote:
> > arch_validate_prot() is a hook that can validate whether a given set of
> > protection flags is valid in an mprotect() operation. It is given the set
> > of protection flags and the address being modified.
> > 
> > However, the address being modified can currently not actually be used in
> > a meaningful way because:
> > 
> > 1. Only the address is given, but not the length, and the operation can
> >span multiple VMAs. Therefore, the callee can't actually tell which
> >virtual address range, or which VMAs, are being targeted.
> > 2. The mmap_lock is not held, meaning that if the callee were to check
> >the VMA at @addr, that VMA would be unrelated to the one the
> >operation is performed on.
> > 
> > Currently, custom arch_validate_prot() handlers are defined by
> > arm64, powerpc and sparc.
> > arm64 and powerpc don't care about the address range, they just check the
> > flags against CPU support masks.
> > sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> > the mmap_lock.
> > 
> > Change the function signature to also take a length, and move the
> > arch_validate_prot() call in mm/mprotect.c down into the locked region.
[...]
> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
> is made without holding mmap_lock. Lock is not acquired until
> vm_mmap_pgoff(). This variance is uncomfortable but I am more
> uncomfortable forcing all implementations of validate_prot to require
> mmap_lock be held when non-sparc implementations do not have such need
> yet. Since do_mmap2() is in powerpc specific code, for now this patch
> solves a current problem.

I still think sparc should avoid walking the vmas in
arch_validate_prot(). The core code already has the vmas, though not
when calling arch_validate_prot(). That's one of the reasons I added
arch_validate_flags() with the MTE patches. For sparc, this could be
(untested, just copied the arch_validate_prot() code):

static inline bool arch_validate_flags(unsigned long vm_flags)
{
if (!(vm_flags & VM_SPARC_ADI))
return true;

if (!adi_capable())
return false;

/* ADI can not be enabled on PFN mapped pages */
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
return false;

/*
 * Mergeable pages can become unmergeable if ADI is enabled on
 * them even if they have identical data on them. This can be
 * because ADI enabled pages with identical data may still not
 * have identical ADI tags on them. Disallow ADI on mergeable
 * pages.
 */
if (vma->vm_flags & VM_MERGEABLE)
return false;

return true;
}

> That leaves open the question of should
> generic mmap call arch_validate_prot and return EINVAL for invalid
> combination of protection bits, but that is better addressed in a
> separate patch.

The above would cover mmap() as well.

The current sparc_validate_prot() relies on finding the vma for the
corresponding address. However, if you call this early in the mmap()
path, there's no such vma. It is only created later in mmap_region()
which no longer has the original PROT_* flags (all converted to VM_*
flags).

Calling arch_validate_flags() on mmap() has a small side-effect on the
user ABI: if the CPU is not adi_capable(), PROT_ADI is currently ignored
on mmap() but rejected by sparc_validate_prot(). Powerpc already does
this already and I think it should be fine for arm64 (it needs checking
though as we have another flag, PROT_BTI, hopefully dynamic loaders
don't pass this flag unconditionally).

However, as I said above, it doesn't solve the mmap() PROT_ADI checking
for sparc since there's no vma yet. I'd strongly recommend the
arch_validate_flags() approach and reverting the "start" parameter added
to arch_validate_prot() if you go for the flags route.

Thanks.

-- 
Catalin

1 2 >

1 - 100 of 193 matches

Mail list logo