from:"Pavel Tatashin"

Re: [PATCH v13 03/18] arm64: hyp-stub: Move el1_sync into the vectors

2021-04-08 Thread Pavel Tatashin

> > Thank you for noticing this. Not sure how this missmerge happened. I
> > have added the missing case, and VHE is initialized correctly during
> > boot.
> > [   14.698175] kvm [1]: VHE mode initialized successfully
> >
> > During normal boot, kexec reboot, and kdump reboot. I will respin the
> > series and send the version 14 soon.
>
> Please give people a chance to review this lot first. This isn't code
> that is easy to digest, and immediate re-spinning does more harm than
> good (this isn't targeting 5.13, I would assume).
>

There are people who are testing this series, this is why I wanted to
respin. But, I will wait for review comments before sending the next
version. In the meantime I will send a fixed version of this patch as
a reply to this thread instead.

Thanks,
Pasha

Re: [PATCH v13 03/18] arm64: hyp-stub: Move el1_sync into the vectors

2021-04-08 Thread Pavel Tatashin

On Thu, Apr 8, 2021 at 6:24 AM Marc Zyngier  wrote:
>
> On 2021-04-08 05:05, Pavel Tatashin wrote:
> > From: James Morse 
> >
> > The hyp-stub's el1_sync code doesn't do very much, this can easily fit
> > in the vectors.
> >
> > With this, all of the hyp-stubs behaviour is contained in its vectors.
> > This lets kexec and hibernate copy the hyp-stub when they need its
> > behaviour, instead of re-implementing it.
> >
> > Signed-off-by: James Morse 
> >
> > [Fixed merging issues]
>
> That's a pretty odd fix IMO.
>
> >
> > Signed-off-by: Pavel Tatashin 
> > ---
> >  arch/arm64/kernel/hyp-stub.S | 59 ++--
> >  1 file changed, 29 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/hyp-stub.S
> > b/arch/arm64/kernel/hyp-stub.S
> > index ff329c5c074d..d1a73d0f74e0 100644
> > --- a/arch/arm64/kernel/hyp-stub.S
> > +++ b/arch/arm64/kernel/hyp-stub.S
> > @@ -21,6 +21,34 @@ SYM_CODE_START_LOCAL(\label)
> >   .align 7
> >   b   \label
> >  SYM_CODE_END(\label)
> > +.endm
> > +
> > +.macro hyp_stub_el1_sync
> > +SYM_CODE_START_LOCAL(hyp_stub_el1_sync)
> > + .align 7
> > + cmp x0, #HVC_SET_VECTORS
> > + b.ne2f
> > + msr vbar_el2, x1
> > + b   9f
> > +
> > +2:   cmp x0, #HVC_SOFT_RESTART
> > + b.ne3f
> > + mov x0, x2
> > + mov x2, x4
> > + mov x4, x1
> > + mov x1, x3
> > + br  x4  // no return
> > +
> > +3:   cmp x0, #HVC_RESET_VECTORS
> > + beq 9f  // Nothing to reset!
> > +
> > + /* Someone called kvm_call_hyp() against the hyp-stub... */
> > + mov_q   x0, HVC_STUB_ERR
> > + eret
> > +
> > +9:   mov x0, xzr
> > + eret
> > +SYM_CODE_END(hyp_stub_el1_sync)
>
> You said you tested this on a TX2. I guess you don't care whether
> it runs VHE or not...

Hi Marc,

Thank you for noticing this. Not sure how this missmerge happened. I
have added the missing case, and VHE is initialized correctly during
boot.
[   14.698175] kvm [1]: VHE mode initialized successfully

During normal boot, kexec reboot, and kdump reboot. I will respin the
series and send the version 14 soon.

Thanks,
Pasha

>
>  M.
>
> >  .endm
> >
> >   .text
> > @@ -39,7 +67,7 @@ SYM_CODE_START(__hyp_stub_vectors)
> >   invalid_vector  hyp_stub_el2h_fiq_invalid   // FIQ EL2h
> >   invalid_vector  hyp_stub_el2h_error_invalid // Error EL2h
> >
> > - ventry  el1_sync// Synchronous 64-bit EL1
> > + hyp_stub_el1_sync   // Synchronous 64-bit 
> > EL1
> >   invalid_vector  hyp_stub_el1_irq_invalid// IRQ 64-bit EL1
> >   invalid_vector  hyp_stub_el1_fiq_invalid// FIQ 64-bit EL1
> >   invalid_vector  hyp_stub_el1_error_invalid  // Error 64-bit EL1
> > @@ -55,35 +83,6 @@ SYM_CODE_END(__hyp_stub_vectors)
> >  # Check the __hyp_stub_vectors didn't overflow
> >  .org . - (__hyp_stub_vectors_end - __hyp_stub_vectors) + SZ_2K
> >
> > -
> > -SYM_CODE_START_LOCAL(el1_sync)
> > - cmp x0, #HVC_SET_VECTORS
> > - b.ne1f
> > - msr vbar_el2, x1
> > - b   9f
> > -
> > -1:   cmp x0, #HVC_VHE_RESTART
> > - b.eqmutate_to_vhe
> > -
> > -2:   cmp x0, #HVC_SOFT_RESTART
> > - b.ne3f
> > - mov x0, x2
> > - mov x2, x4
> > - mov x4, x1
> > - mov x1, x3
> > - br  x4  // no return
> > -
> > -3:   cmp x0, #HVC_RESET_VECTORS
> > - beq 9f  // Nothing to reset!
> > -
> > - /* Someone called kvm_call_hyp() against the hyp-stub... */
> > - mov_q   x0, HVC_STUB_ERR
> > - eret
> > -
> > -9:   mov x0, xzr
> > - eret
> > -SYM_CODE_END(el1_sync)
> > -
> >  // nVHE? No way! Give me the real thing!
> >  SYM_CODE_START_LOCAL(mutate_to_vhe)
> >   // Sanity check: MMU *must* be off
>
> --
> Jazz is not dead. It just smells funny...

[PATCH v13 14/18] arm64: kexec: install a copy of the linear-map

2021-04-07 Thread Pavel Tatashin

To perform the kexec relocations with the MMU enabled, we need a copy
of the linear map.

Create one, and install it from the relocation code. This has to be done
from the assembly code as it will be idmapped with TTBR0. The kernel
runs in TTRB1, so can't use the break-before-make sequence on the mapping
it is executing from.

The makes no difference yet as the relocation code runs with the MMU
disabled.

Co-developed-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/assembler.h  | 19 +++
 arch/arm64/include/asm/kexec.h  |  2 ++
 arch/arm64/kernel/asm-offsets.c |  2 ++
 arch/arm64/kernel/hibernate-asm.S   | 20 
 arch/arm64/kernel/machine_kexec.c   | 16 ++--
 arch/arm64/kernel/relocate_kernel.S |  3 +++
 6 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 29061b76aab6..3ce8131ad660 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -425,6 +425,25 @@ USER(\label, icivau, \tmp2)// 
invalidate I line PoU
isb
.endm
 
+/*
+ * To prevent the possibility of old and new partial table walks being visible
+ * in the tlb, switch the ttbr to a zero page when we invalidate the old
+ * records. D4.7.1 'General TLB maintenance requirements' in ARM DDI 0487A.i
+ * Even switching to our copied tables will cause a changed output address at
+ * each stage of the walk.
+ */
+   .macro break_before_make_ttbr_switch zero_page, page_table, tmp, tmp2
+   phys_to_ttbr \tmp, \zero_page
+   msr ttbr1_el1, \tmp
+   isb
+   tlbivmalle1
+   dsb nsh
+   phys_to_ttbr \tmp, \page_table
+   offset_ttbr1 \tmp, \tmp2
+   msr ttbr1_el1, \tmp
+   isb
+   .endm
+
 /*
  * reset_pmuserenr_el0 - reset PMUSERENR_EL0 if PMUv3 present
  */
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 305cf0840ed3..59ac166daf53 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -97,6 +97,8 @@ struct kimage_arch {
phys_addr_t dtb_mem;
phys_addr_t kern_reloc;
phys_addr_t el2_vectors;
+   phys_addr_t ttbr1;
+   phys_addr_t zero_page;
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_mem;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 2e3278df1fc3..609362b5aa76 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -158,6 +158,8 @@ int main(void)
 #ifdef CONFIG_KEXEC_CORE
   DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
   DEFINE(KIMAGE_ARCH_EL2_VECTORS,  offsetof(struct kimage, 
arch.el2_vectors));
+  DEFINE(KIMAGE_ARCH_ZERO_PAGE,offsetof(struct kimage, 
arch.zero_page));
+  DEFINE(KIMAGE_ARCH_TTBR1,offsetof(struct kimage, arch.ttbr1));
   DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
   DEFINE(KIMAGE_START, offsetof(struct kimage, start));
   BLANK();
diff --git a/arch/arm64/kernel/hibernate-asm.S 
b/arch/arm64/kernel/hibernate-asm.S
index 8ccca660034e..a31e621ba867 100644
--- a/arch/arm64/kernel/hibernate-asm.S
+++ b/arch/arm64/kernel/hibernate-asm.S
@@ -15,26 +15,6 @@
 #include 
 #include 
 
-/*
- * To prevent the possibility of old and new partial table walks being visible
- * in the tlb, switch the ttbr to a zero page when we invalidate the old
- * records. D4.7.1 'General TLB maintenance requirements' in ARM DDI 0487A.i
- * Even switching to our copied tables will cause a changed output address at
- * each stage of the walk.
- */
-.macro break_before_make_ttbr_switch zero_page, page_table, tmp, tmp2
-   phys_to_ttbr \tmp, \zero_page
-   msr ttbr1_el1, \tmp
-   isb
-   tlbivmalle1
-   dsb nsh
-   phys_to_ttbr \tmp, \page_table
-   offset_ttbr1 \tmp, \tmp2
-   msr ttbr1_el1, \tmp
-   isb
-.endm
-
-
 /*
  * Resume from hibernate
  *
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index f1451d807708..c875ef522e53 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -153,6 +153,8 @@ static void *kexec_page_alloc(void *arg)
 
 int machine_kexec_post_load(struct kimage *kimage)
 {
+   int rc;
+   pgd_t *trans_pgd;
void *reloc_code = page_to_virt(kimage->control_code_page);
long reloc_size;
struct trans_pgd_info info = {
@@ -169,12 +171,22 @@ int machine_kexec_post_load(struct kimage *kimage)
 
kimage->arch.el2_vectors = 0;
if (is_hyp_callable()) {
-   int rc = trans_pgd_copy_el2_vectors(,
-   >arch.el2_vectors);
+   rc = trans_pgd_copy_el

[PATCH v13 18/18] arm64/mm: remove useless trans_pgd_map_page()

2021-04-07 Thread Pavel Tatashin

From: Pingfan Liu 

The intend of trans_pgd_map_page() was to map contigous range of VA
memory to the memory that is getting relocated during kexec. However,
since we are now using linear map instead of contigous range this
function is not needed

Signed-off-by: Pingfan Liu 
[Changed commit message]
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/trans_pgd.h |  5 +--
 arch/arm64/mm/trans_pgd.c  | 57 --
 2 files changed, 1 insertion(+), 61 deletions(-)

diff --git a/arch/arm64/include/asm/trans_pgd.h 
b/arch/arm64/include/asm/trans_pgd.h
index e0760e52d36d..234353df2f13 100644
--- a/arch/arm64/include/asm/trans_pgd.h
+++ b/arch/arm64/include/asm/trans_pgd.h
@@ -15,7 +15,7 @@
 /*
  * trans_alloc_page
  * - Allocator that should return exactly one zeroed page, if this
- *   allocator fails, trans_pgd_create_copy() and trans_pgd_map_page()
+ *   allocator fails, trans_pgd_create_copy() and trans_pgd_idmap_page()
  *   return -ENOMEM error.
  *
  * trans_alloc_arg
@@ -30,9 +30,6 @@ struct trans_pgd_info {
 int trans_pgd_create_copy(struct trans_pgd_info *info, pgd_t **trans_pgd,
  unsigned long start, unsigned long end);
 
-int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
-  void *page, unsigned long dst_addr, pgprot_t pgprot);
-
 int trans_pgd_idmap_page(struct trans_pgd_info *info, phys_addr_t *trans_ttbr0,
 unsigned long *t0sz, void *page);
 
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 61549451ed3a..e24a749013c1 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -217,63 +217,6 @@ int trans_pgd_create_copy(struct trans_pgd_info *info, 
pgd_t **dst_pgdp,
return rc;
 }
 
-/*
- * Add map entry to trans_pgd for a base-size page at PTE level.
- * info:   contains allocator and its argument
- * trans_pgd:  page table in which new map is added.
- * page:   page to be mapped.
- * dst_addr:   new VA address for the page
- * pgprot: protection for the page.
- *
- * Returns 0 on success, and -ENOMEM on failure.
- */
-int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
-  void *page, unsigned long dst_addr, pgprot_t pgprot)
-{
-   pgd_t *pgdp;
-   p4d_t *p4dp;
-   pud_t *pudp;
-   pmd_t *pmdp;
-   pte_t *ptep;
-
-   pgdp = pgd_offset_pgd(trans_pgd, dst_addr);
-   if (pgd_none(READ_ONCE(*pgdp))) {
-   p4dp = trans_alloc(info);
-   if (!pgdp)
-   return -ENOMEM;
-   pgd_populate(NULL, pgdp, p4dp);
-   }
-
-   p4dp = p4d_offset(pgdp, dst_addr);
-   if (p4d_none(READ_ONCE(*p4dp))) {
-   pudp = trans_alloc(info);
-   if (!pudp)
-   return -ENOMEM;
-   p4d_populate(NULL, p4dp, pudp);
-   }
-
-   pudp = pud_offset(p4dp, dst_addr);
-   if (pud_none(READ_ONCE(*pudp))) {
-   pmdp = trans_alloc(info);
-   if (!pmdp)
-   return -ENOMEM;
-   pud_populate(NULL, pudp, pmdp);
-   }
-
-   pmdp = pmd_offset(pudp, dst_addr);
-   if (pmd_none(READ_ONCE(*pmdp))) {
-   ptep = trans_alloc(info);
-   if (!ptep)
-   return -ENOMEM;
-   pmd_populate_kernel(NULL, pmdp, ptep);
-   }
-
-   ptep = pte_offset_kernel(pmdp, dst_addr);
-   set_pte(ptep, pfn_pte(virt_to_pfn(page), pgprot));
-
-   return 0;
-}
-
 /*
  * The page we want to idmap may be outside the range covered by VA_BITS that
  * can be built using the kernel's p?d_populate() helpers. As a one off, for a
-- 
2.25.1

[PATCH v13 17/18] arm64: kexec: Remove cpu-reset.h

2021-04-07 Thread Pavel Tatashin

This header contains only cpu_soft_restart() which is never used directly
anymore. So, remove this header, and rename the helper to be
cpu_soft_restart().

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/kexec.h|  6 ++
 arch/arm64/kernel/cpu-reset.S |  7 +++
 arch/arm64/kernel/cpu-reset.h | 30 --
 arch/arm64/kernel/machine_kexec.c |  6 ++
 4 files changed, 11 insertions(+), 38 deletions(-)
 delete mode 100644 arch/arm64/kernel/cpu-reset.h

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 5fc87b51f8a9..ee71ae3b93ed 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,6 +90,12 @@ static inline void crash_prepare_suspend(void) {}
 static inline void crash_post_resume(void) {}
 #endif
 
+#if defined(CONFIG_KEXEC_CORE)
+void cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
+ unsigned long arg0, unsigned long arg1,
+ unsigned long arg2);
+#endif
+
 #define ARCH_HAS_KIMAGE_ARCH
 
 struct kimage_arch {
diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
index 37721eb6f9a1..5d47d6c92634 100644
--- a/arch/arm64/kernel/cpu-reset.S
+++ b/arch/arm64/kernel/cpu-reset.S
@@ -16,8 +16,7 @@
 .pushsection.idmap.text, "awx"
 
 /*
- * __cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2) - Helper for
- * cpu_soft_restart.
+ * cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2)
  *
  * @el2_switch: Flag to indicate a switch to EL2 is needed.
  * @entry: Location to jump to for soft reset.
@@ -29,7 +28,7 @@
  * branch to what would be the reset vector. It must be executed with the
  * flat identity mapping.
  */
-SYM_CODE_START(__cpu_soft_restart)
+SYM_CODE_START(cpu_soft_restart)
/* Clear sctlr_el1 flags. */
mrs x12, sctlr_el1
mov_q   x13, SCTLR_ELx_FLAGS
@@ -51,6 +50,6 @@ SYM_CODE_START(__cpu_soft_restart)
mov x1, x3  // arg1
mov x2, x4  // arg2
br  x8
-SYM_CODE_END(__cpu_soft_restart)
+SYM_CODE_END(cpu_soft_restart)
 
 .popsection
diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
deleted file mode 100644
index f6d95512fec6..
--- a/arch/arm64/kernel/cpu-reset.h
+++ /dev/null
@@ -1,30 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * CPU reset routines
- *
- * Copyright (C) 2015 Huawei Futurewei Technologies.
- */
-
-#ifndef _ARM64_CPU_RESET_H
-#define _ARM64_CPU_RESET_H
-
-#include 
-
-void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
-   unsigned long arg0, unsigned long arg1, unsigned long arg2);
-
-static inline void __noreturn cpu_soft_restart(unsigned long entry,
-  unsigned long arg0,
-  unsigned long arg1,
-  unsigned long arg2)
-{
-   typeof(__cpu_soft_restart) *restart;
-
-   restart = (void *)__pa_symbol(__cpu_soft_restart);
-
-   cpu_install_idmap();
-   restart(0, entry, arg0, arg1, arg2);
-   unreachable();
-}
-
-#endif
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index a1c9bee0cddd..ef7ba93f2bd6 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -23,8 +23,6 @@
 #include 
 #include 
 
-#include "cpu-reset.h"
-
 /**
  * kexec_image_info - For debugging output.
  */
@@ -197,10 +195,10 @@ void machine_kexec(struct kimage *kimage)
 * In kexec_file case, the kernel starts directly without purgatory.
 */
if (kimage->head & IND_DONE) {
-   typeof(__cpu_soft_restart) *restart;
+   typeof(cpu_soft_restart) *restart;
 
cpu_install_idmap();
-   restart = (void *)__pa_symbol(__cpu_soft_restart);
+   restart = (void *)__pa_symbol(cpu_soft_restart);
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
-- 
2.25.1

[PATCH v13 13/18] arm64: kexec: use ld script for relocation function

2021-04-07 Thread Pavel Tatashin

Currently, relocation code declares start and end variables
which are used to compute its size.

The better way to do this is to use ld script incited, and put relocation
function in its own section.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/sections.h   |  1 +
 arch/arm64/kernel/machine_kexec.c   | 14 ++
 arch/arm64/kernel/relocate_kernel.S | 15 ++-
 arch/arm64/kernel/vmlinux.lds.S | 19 +++
 4 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/sections.h 
b/arch/arm64/include/asm/sections.h
index 2f36b16a5b5d..31e459af89f6 100644
--- a/arch/arm64/include/asm/sections.h
+++ b/arch/arm64/include/asm/sections.h
@@ -20,5 +20,6 @@ extern char __exittext_begin[], __exittext_end[];
 extern char __irqentry_text_start[], __irqentry_text_end[];
 extern char __mmuoff_data_start[], __mmuoff_data_end[];
 extern char __entry_tramp_text_start[], __entry_tramp_text_end[];
+extern char __relocate_new_kernel_start[], __relocate_new_kernel_end[];
 
 #endif /* __ASM_SECTIONS_H */
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index d5940b7889f8..f1451d807708 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -20,14 +20,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpu-reset.h"
 
-/* Global variables for the arm64_relocate_new_kernel routine. */
-extern const unsigned char arm64_relocate_new_kernel[];
-extern const unsigned long arm64_relocate_new_kernel_size;
-
 /**
  * kexec_image_info - For debugging output.
  */
@@ -157,6 +154,7 @@ static void *kexec_page_alloc(void *arg)
 int machine_kexec_post_load(struct kimage *kimage)
 {
void *reloc_code = page_to_virt(kimage->control_code_page);
+   long reloc_size;
struct trans_pgd_info info = {
.trans_alloc_page   = kexec_page_alloc,
.trans_alloc_arg= kimage,
@@ -177,14 +175,14 @@ int machine_kexec_post_load(struct kimage *kimage)
return rc;
}
 
-   memcpy(reloc_code, arm64_relocate_new_kernel,
-  arm64_relocate_new_kernel_size);
+   reloc_size = __relocate_new_kernel_end - __relocate_new_kernel_start;
+   memcpy(reloc_code, __relocate_new_kernel_start, reloc_size);
kimage->arch.kern_reloc = __pa(reloc_code);
 
/* Flush the reloc_code in preparation for its execution. */
-   __flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
+   __flush_dcache_area(reloc_code, reloc_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
-  arm64_relocate_new_kernel_size);
+  reloc_size);
kexec_list_flush(kimage);
kexec_image_info(kimage);
 
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index df023b82544b..7a600ba33ae1 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -15,6 +15,7 @@
 #include 
 #include 
 
+.pushsection".kexec_relocate.text", "ax"
 /*
  * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
  *
@@ -77,16 +78,4 @@ SYM_CODE_START(arm64_relocate_new_kernel)
mov x3, xzr
br  x4  /* Jumps from el1 */
 SYM_CODE_END(arm64_relocate_new_kernel)
-
-.align 3   /* To keep the 64-bit values below naturally aligned. */
-
-.Lcopy_end:
-.org   KEXEC_CONTROL_PAGE_SIZE
-
-/*
- * arm64_relocate_new_kernel_size - Number of bytes to copy to the
- * control_code_page.
- */
-.globl arm64_relocate_new_kernel_size
-arm64_relocate_new_kernel_size:
-   .quad   .Lcopy_end - arm64_relocate_new_kernel
+.popsection
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 7eea7888bb02..0d9d5e6af66f 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -92,6 +93,16 @@ jiffies = jiffies_64;
 #define HIBERNATE_TEXT
 #endif
 
+#ifdef CONFIG_KEXEC_CORE
+#define KEXEC_TEXT \
+   . = ALIGN(SZ_4K);   \
+   __relocate_new_kernel_start = .;\
+   *(.kexec_relocate.text) \
+   __relocate_new_kernel_end = .;
+#else
+#define KEXEC_TEXT
+#endif
+
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 #define TRAMP_TEXT \
. = ALIGN(PAGE_SIZE);   \
@@ -152,6 +163,7 @@ SECTIONS
HYPERVISOR_TEXT
IDMAP_TEXT
HIBERNATE_TEXT
+   KEXEC_TEXT
TRAMP_TEXT
*(.fixup)
*(.gnu.warning)
@@ -336,3 +348,10 @@ ASS

[PATCH v13 16/18] arm64: kexec: remove the pre-kexec PoC maintenance

2021-04-07 Thread Pavel Tatashin

Now that kexec does its relocations with the MMU enabled, we no longer
need to clean the relocation data to the PoC.

Co-developed-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/machine_kexec.c | 40 ---
 1 file changed, 40 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index d5c8aefc66f3..a1c9bee0cddd 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -76,45 +76,6 @@ int machine_kexec_prepare(struct kimage *kimage)
return 0;
 }
 
-/**
- * kexec_list_flush - Helper to flush the kimage list and source pages to PoC.
- */
-static void kexec_list_flush(struct kimage *kimage)
-{
-   kimage_entry_t *entry;
-
-   __flush_dcache_area(kimage, sizeof(*kimage));
-
-   for (entry = >head; ; entry++) {
-   unsigned int flag;
-   void *addr;
-
-   /* flush the list entries. */
-   __flush_dcache_area(entry, sizeof(kimage_entry_t));
-
-   flag = *entry & IND_FLAGS;
-   if (flag == IND_DONE)
-   break;
-
-   addr = phys_to_virt(*entry & PAGE_MASK);
-
-   switch (flag) {
-   case IND_INDIRECTION:
-   /* Set entry point just before the new list page. */
-   entry = (kimage_entry_t *)addr - 1;
-   break;
-   case IND_SOURCE:
-   /* flush the source pages. */
-   __flush_dcache_area(addr, PAGE_SIZE);
-   break;
-   case IND_DESTINATION:
-   break;
-   default:
-   BUG();
-   }
-   }
-}
-
 /**
  * kexec_segment_flush - Helper to flush the kimage segments to PoC.
  */
@@ -200,7 +161,6 @@ int machine_kexec_post_load(struct kimage *kimage)
__flush_dcache_area(reloc_code, reloc_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
   reloc_size);
-   kexec_list_flush(kimage);
kexec_image_info(kimage);
 
return 0;
-- 
2.25.1

[PATCH v13 15/18] arm64: kexec: keep MMU enabled during kexec relocation

2021-04-07 Thread Pavel Tatashin

Now, that we have linear map page tables configured, keep MMU enabled
to allow faster relocation of segments to final destination.


Cavium ThunderX2:
Kernel Image size: 38M Iniramfs size: 46M Total relocation size: 84M
MMU-disabled:
relocation  7.489539915s
MMU-enabled:
relocation  0.03946095s

Broadcom Stingray:
The performance data: for a moderate size kernel + initramfs: 25M the
relocation was taking 0.382s, with enabled MMU it now takes
0.019s only or x20 improvement.

The time is proportional to the size of relocation, therefore if initramfs
is larger, 100M it could take over a second.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/kexec.h  |  3 +++
 arch/arm64/kernel/asm-offsets.c |  1 +
 arch/arm64/kernel/machine_kexec.c   | 16 ++
 arch/arm64/kernel/relocate_kernel.S | 33 +++--
 4 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 59ac166daf53..5fc87b51f8a9 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -97,8 +97,11 @@ struct kimage_arch {
phys_addr_t dtb_mem;
phys_addr_t kern_reloc;
phys_addr_t el2_vectors;
+   phys_addr_t ttbr0;
phys_addr_t ttbr1;
phys_addr_t zero_page;
+   unsigned long phys_offset;
+   unsigned long t0sz;
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_mem;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 609362b5aa76..ec7bb80aedc8 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -159,6 +159,7 @@ int main(void)
   DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
   DEFINE(KIMAGE_ARCH_EL2_VECTORS,  offsetof(struct kimage, 
arch.el2_vectors));
   DEFINE(KIMAGE_ARCH_ZERO_PAGE,offsetof(struct kimage, 
arch.zero_page));
+  DEFINE(KIMAGE_ARCH_PHYS_OFFSET,  offsetof(struct kimage, 
arch.phys_offset));
   DEFINE(KIMAGE_ARCH_TTBR1,offsetof(struct kimage, arch.ttbr1));
   DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
   DEFINE(KIMAGE_START, offsetof(struct kimage, start));
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index c875ef522e53..d5c8aefc66f3 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -190,6 +190,11 @@ int machine_kexec_post_load(struct kimage *kimage)
reloc_size = __relocate_new_kernel_end - __relocate_new_kernel_start;
memcpy(reloc_code, __relocate_new_kernel_start, reloc_size);
kimage->arch.kern_reloc = __pa(reloc_code);
+   rc = trans_pgd_idmap_page(, >arch.ttbr0,
+ >arch.t0sz, reloc_code);
+   if (rc)
+   return rc;
+   kimage->arch.phys_offset = virt_to_phys(kimage) - (long)kimage;
 
/* Flush the reloc_code in preparation for its execution. */
__flush_dcache_area(reloc_code, reloc_size);
@@ -223,9 +228,9 @@ void machine_kexec(struct kimage *kimage)
local_daif_mask();
 
/*
-* Both restart and cpu_soft_restart will shutdown the MMU, disable data
+* Both restart and kernel_reloc will shutdown the MMU, disable data
 * caches. However, restart will start new kernel or purgatory directly,
-* cpu_soft_restart will transfer control to arm64_relocate_new_kernel
+* kernel_reloc contains the body of arm64_relocate_new_kernel
 * In kexec case, kimage->start points to purgatory assuming that
 * kernel entry and dtb address are embedded in purgatory by
 * userspace (kexec-tools).
@@ -239,10 +244,13 @@ void machine_kexec(struct kimage *kimage)
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
+   void (*kernel_reloc)(struct kimage *kimage);
+
if (is_hyp_callable())
__hyp_set_vectors(kimage->arch.el2_vectors);
-   cpu_soft_restart(kimage->arch.kern_reloc,
-virt_to_phys(kimage), 0, 0);
+   cpu_install_ttbr0(kimage->arch.ttbr0, kimage->arch.t0sz);
+   kernel_reloc = (void *)kimage->arch.kern_reloc;
+   kernel_reloc(kimage);
}
 
BUG(); /* Should never get here. */
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index e83b6380907d..433a57b3d76e 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -4,6 +4,8 @@
  *
  * Copyright (C) Linaro.
  * Copyright (C) Huawei Futurewei Technologies.
+ * Copyright (C) 2020, Microsoft Corporation.
+ * Pavel Tatashin 
  */
 
 #include 
@@ -15,6 +17,15 @@
 #include 
 #include 
 
+.macro turn_off_mmu tmp1, t

[PATCH v13 12/18] arm64: kexec: relocate in EL1 mode

2021-04-07 Thread Pavel Tatashin

Since we are going to keep MMU enabled during relocation, we need to
keep EL1 mode throughout the relocation.

Keep EL1 enabled, and switch EL2 only before enterying the new world.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/cpu-reset.h   |  3 +--
 arch/arm64/kernel/machine_kexec.c   |  4 ++--
 arch/arm64/kernel/relocate_kernel.S | 13 +++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index 1922e7a690f8..f6d95512fec6 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -20,11 +20,10 @@ static inline void __noreturn cpu_soft_restart(unsigned 
long entry,
 {
typeof(__cpu_soft_restart) *restart;
 
-   unsigned long el2_switch = is_hyp_callable();
restart = (void *)__pa_symbol(__cpu_soft_restart);
 
cpu_install_idmap();
-   restart(el2_switch, entry, arg0, arg1, arg2);
+   restart(0, entry, arg0, arg1, arg2);
unreachable();
 }
 
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index fb03b6676fb9..d5940b7889f8 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -231,8 +231,8 @@ void machine_kexec(struct kimage *kimage)
} else {
if (is_hyp_callable())
__hyp_set_vectors(kimage->arch.el2_vectors);
-   cpu_soft_restart(kimage->arch.kern_reloc, virt_to_phys(kimage),
-0, 0);
+   cpu_soft_restart(kimage->arch.kern_reloc,
+virt_to_phys(kimage), 0, 0);
}
 
BUG(); /* Should never get here. */
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index 36b4496524c3..df023b82544b 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
@@ -61,12 +62,20 @@ SYM_CODE_START(arm64_relocate_new_kernel)
isb
 
/* Start new image. */
+   ldr x1, [x0, #KIMAGE_ARCH_EL2_VECTORS]  /* relocation start */
+   cbz x1, .Lel1
+   ldr x1, [x0, #KIMAGE_START] /* relocation start */
+   ldr x2, [x0, #KIMAGE_ARCH_DTB_MEM]  /* dtb address */
+   mov x3, xzr
+   mov x4, xzr
+   mov x0, #HVC_SOFT_RESTART
+   hvc #0  /* Jumps from el2 */
+.Lel1:
ldr x4, [x0, #KIMAGE_START] /* relocation start */
ldr x0, [x0, #KIMAGE_ARCH_DTB_MEM]  /* dtb address */
-   mov x1, xzr
mov x2, xzr
mov x3, xzr
-   br  x4
+   br  x4  /* Jumps from el1 */
 SYM_CODE_END(arm64_relocate_new_kernel)
 
 .align 3   /* To keep the 64-bit values below naturally aligned. */
-- 
2.25.1

[PATCH v13 11/18] arm64: kexec: kexec may require EL2 vectors

2021-04-07 Thread Pavel Tatashin

If we have a EL2 mode without VHE, the EL2 vectors are needed in order
to switch to EL2 and jump to new world with hypervisor privileges.

In preporation to MMU enabled relocation, configure our EL2 table now.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/Kconfig|  2 +-
 arch/arm64/include/asm/kexec.h|  1 +
 arch/arm64/kernel/asm-offsets.c   |  1 +
 arch/arm64/kernel/machine_kexec.c | 31 +++
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..0e876d980a1f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1149,7 +1149,7 @@ config CRASH_DUMP
 
 config TRANS_TABLE
def_bool y
-   depends on HIBERNATION
+   depends on HIBERNATION || KEXEC_CORE
 
 config XEN_DOM0
def_bool y
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 9befcd87e9a8..305cf0840ed3 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -96,6 +96,7 @@ struct kimage_arch {
void *dtb;
phys_addr_t dtb_mem;
phys_addr_t kern_reloc;
+   phys_addr_t el2_vectors;
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_mem;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0c92e193f866..2e3278df1fc3 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -157,6 +157,7 @@ int main(void)
 #endif
 #ifdef CONFIG_KEXEC_CORE
   DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
+  DEFINE(KIMAGE_ARCH_EL2_VECTORS,  offsetof(struct kimage, 
arch.el2_vectors));
   DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
   DEFINE(KIMAGE_START, offsetof(struct kimage, start));
   BLANK();
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 2e734e4ae12e..fb03b6676fb9 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cpu-reset.h"
 
@@ -42,7 +43,9 @@ static void _kexec_image_info(const char *func, int line,
pr_debug("start:   %lx\n", kimage->start);
pr_debug("head:%lx\n", kimage->head);
pr_debug("nr_segments: %lu\n", kimage->nr_segments);
+   pr_debug("dtb_mem: %pa\n", >arch.dtb_mem);
pr_debug("kern_reloc: %pa\n", >arch.kern_reloc);
+   pr_debug("el2_vectors: %pa\n", >arch.el2_vectors);
 
for (i = 0; i < kimage->nr_segments; i++) {
pr_debug("  segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu 
pages\n",
@@ -137,9 +140,27 @@ static void kexec_segment_flush(const struct kimage 
*kimage)
}
 }
 
+/* Allocates pages for kexec page table */
+static void *kexec_page_alloc(void *arg)
+{
+   struct kimage *kimage = (struct kimage *)arg;
+   struct page *page = kimage_alloc_control_pages(kimage, 0);
+
+   if (!page)
+   return NULL;
+
+   memset(page_address(page), 0, PAGE_SIZE);
+
+   return page_address(page);
+}
+
 int machine_kexec_post_load(struct kimage *kimage)
 {
void *reloc_code = page_to_virt(kimage->control_code_page);
+   struct trans_pgd_info info = {
+   .trans_alloc_page   = kexec_page_alloc,
+   .trans_alloc_arg= kimage,
+   };
 
/* If in place, relocation is not used, only flush next kernel */
if (kimage->head & IND_DONE) {
@@ -148,6 +169,14 @@ int machine_kexec_post_load(struct kimage *kimage)
return 0;
}
 
+   kimage->arch.el2_vectors = 0;
+   if (is_hyp_callable()) {
+   int rc = trans_pgd_copy_el2_vectors(,
+   >arch.el2_vectors);
+   if (rc)
+   return rc;
+   }
+
memcpy(reloc_code, arm64_relocate_new_kernel,
   arm64_relocate_new_kernel_size);
kimage->arch.kern_reloc = __pa(reloc_code);
@@ -200,6 +229,8 @@ void machine_kexec(struct kimage *kimage)
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
+   if (is_hyp_callable())
+   __hyp_set_vectors(kimage->arch.el2_vectors);
cpu_soft_restart(kimage->arch.kern_reloc, virt_to_phys(kimage),
 0, 0);
}
-- 
2.25.1

[PATCH v13 10/18] arm64: kexec: pass kimage as the only argument to relocation function

2021-04-07 Thread Pavel Tatashin

Currently, kexec relocation function (arm64_relocate_new_kernel) accepts
the following arguments:

head:   start of array that contains relocation information.
entry:  entry point for new kernel or purgatory.
dtb_mem:first and only argument to entry.

The number of arguments cannot be easily expended, because this
function is also called from HVC_SOFT_RESTART, which preserves only
three arguments. And, also arm64_relocate_new_kernel is written in
assembly but called without stack, thus no place to move extra arguments
to free registers.

Soon, we will need to pass more arguments: once we enable MMU we
will need to pass information about page tables.

Pass kimage to arm64_relocate_new_kernel, and teach it to get the
required fields from kimage.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/asm-offsets.c |  7 +++
 arch/arm64/kernel/machine_kexec.c   |  6 --
 arch/arm64/kernel/relocate_kernel.S | 10 --
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index a36e2fc330d4..0c92e193f866 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -153,6 +154,12 @@ int main(void)
   DEFINE(PTRAUTH_USER_KEY_APGA,offsetof(struct 
ptrauth_keys_user, apga));
   DEFINE(PTRAUTH_KERNEL_KEY_APIA,  offsetof(struct ptrauth_keys_kernel, 
apia));
   BLANK();
+#endif
+#ifdef CONFIG_KEXEC_CORE
+  DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
+  DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
+  DEFINE(KIMAGE_START, offsetof(struct kimage, start));
+  BLANK();
 #endif
   return 0;
 }
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index b150b65f0b84..2e734e4ae12e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -83,6 +83,8 @@ static void kexec_list_flush(struct kimage *kimage)
 {
kimage_entry_t *entry;
 
+   __flush_dcache_area(kimage, sizeof(*kimage));
+
for (entry = >head; ; entry++) {
unsigned int flag;
void *addr;
@@ -198,8 +200,8 @@ void machine_kexec(struct kimage *kimage)
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
-   cpu_soft_restart(kimage->arch.kern_reloc, kimage->head,
-kimage->start, kimage->arch.dtb_mem);
+   cpu_soft_restart(kimage->arch.kern_reloc, virt_to_phys(kimage),
+0, 0);
}
 
BUG(); /* Should never get here. */
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index 718037bef560..36b4496524c3 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -27,9 +27,7 @@
  */
 SYM_CODE_START(arm64_relocate_new_kernel)
/* Setup the list loop variables. */
-   mov x18, x2 /* x18 = dtb address */
-   mov x17, x1 /* x17 = kimage_start */
-   mov x16, x0 /* x16 = kimage_head */
+   ldr x16, [x0, #KIMAGE_HEAD] /* x16 = kimage_head */
mov x14, xzr/* x14 = entry ptr */
mov x13, xzr/* x13 = copy dest */
raw_dcache_line_size x15, x1/* x15 = dcache line size */
@@ -63,12 +61,12 @@ SYM_CODE_START(arm64_relocate_new_kernel)
isb
 
/* Start new image. */
-   mov x0, x18
+   ldr x4, [x0, #KIMAGE_START] /* relocation start */
+   ldr x0, [x0, #KIMAGE_ARCH_DTB_MEM]  /* dtb address */
mov x1, xzr
mov x2, xzr
mov x3, xzr
-   br  x17
-
+   br  x4
 SYM_CODE_END(arm64_relocate_new_kernel)
 
 .align 3   /* To keep the 64-bit values below naturally aligned. */
-- 
2.25.1

[PATCH v13 09/18] arm64: kexec: Use dcache ops macros instead of open-coding

2021-04-07 Thread Pavel Tatashin

From: James Morse 

kexec does dcache maintenance when it re-writes all memory. Our
dcache_by_line_op macro depends on reading the sanitised DminLine
from memory. Kexec may have overwritten this, so open-codes the
sequence.

dcache_by_line_op is a whole set of macros, it uses dcache_line_size
which uses read_ctr for the sanitsed DminLine. Reading the DminLine
is the first thing the dcache_by_line_op does.

Rename dcache_by_line_op dcache_by_myline_op and take DminLine as
an argument. Kexec can now use the slightly smaller macro.

This makes up-coming changes to the dcache maintenance easier on
the eye.

Code generated by the existing callers is unchanged.

Signed-off-by: James Morse 

[Fixed merging issues]

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/assembler.h  | 12 
 arch/arm64/kernel/relocate_kernel.S | 13 +++--
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index ca31594d3d6c..29061b76aab6 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -371,10 +371,9 @@ alternative_else
 alternative_endif
.endm
 
-   .macro dcache_by_line_op op, domain, kaddr, size, tmp1, tmp2
-   dcache_line_size \tmp1, \tmp2
+   .macro dcache_by_myline_op op, domain, kaddr, size, linesz, tmp2
add \size, \kaddr, \size
-   sub \tmp2, \tmp1, #1
+   sub \tmp2, \linesz, #1
bic \kaddr, \kaddr, \tmp2
 9998:
.ifc\op, cvau
@@ -394,12 +393,17 @@ alternative_endif
.endif
.endif
.endif
-   add \kaddr, \kaddr, \tmp1
+   add \kaddr, \kaddr, \linesz
cmp \kaddr, \size
b.lo9998b
dsb \domain
.endm
 
+   .macro dcache_by_line_op op, domain, kaddr, size, tmp1, tmp2
+   dcache_line_size \tmp1, \tmp2
+   dcache_by_myline_op \op, \domain, \kaddr, \size, \tmp1, \tmp2
+   .endm
+
 /*
  * Macro to perform an instruction cache maintenance for the interval
  * [start, end)
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index 8058fabe0a76..718037bef560 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -41,16 +41,9 @@ SYM_CODE_START(arm64_relocate_new_kernel)
tbz x16, IND_SOURCE_BIT, .Ltest_indirection
 
/* Invalidate dest page to PoC. */
-   mov x2, x13
-   add x20, x2, #PAGE_SIZE
-   sub x1, x15, #1
-   bic x2, x2, x1
-2: dc  ivac, x2
-   add x2, x2, x15
-   cmp x2, x20
-   b.lo2b
-   dsb sy
-
+   mov x2, x13
+   mov x1, #PAGE_SIZE
+   dcache_by_myline_op ivac, sy, x2, x1, x15, x20
copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
b   .Lnext
 .Ltest_indirection:
-- 
2.25.1

[PATCH v13 07/18] arm64: kexec: flush image and lists during kexec load time

2021-04-07 Thread Pavel Tatashin

Currently, during kexec load we are copying relocation function and
flushing it. However, we can also flush kexec relocation buffers and
if new kernel image is already in place (i.e. crash kernel), we can
also flush the new kernel image itself.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/machine_kexec.c | 49 +++
 1 file changed, 23 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 90a335c74442..3a034bc25709 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -59,23 +59,6 @@ void machine_kexec_cleanup(struct kimage *kimage)
/* Empty routine needed to avoid build errors. */
 }
 
-int machine_kexec_post_load(struct kimage *kimage)
-{
-   void *reloc_code = page_to_virt(kimage->control_code_page);
-
-   memcpy(reloc_code, arm64_relocate_new_kernel,
-  arm64_relocate_new_kernel_size);
-   kimage->arch.kern_reloc = __pa(reloc_code);
-   kexec_image_info(kimage);
-
-   /* Flush the reloc_code in preparation for its execution. */
-   __flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
-   flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
-  arm64_relocate_new_kernel_size);
-
-   return 0;
-}
-
 /**
  * machine_kexec_prepare - Prepare for a kexec reboot.
  *
@@ -152,6 +135,29 @@ static void kexec_segment_flush(const struct kimage 
*kimage)
}
 }
 
+int machine_kexec_post_load(struct kimage *kimage)
+{
+   void *reloc_code = page_to_virt(kimage->control_code_page);
+
+   /* If in place flush new kernel image, else flush lists and buffers */
+   if (kimage->head & IND_DONE)
+   kexec_segment_flush(kimage);
+   else
+   kexec_list_flush(kimage);
+
+   memcpy(reloc_code, arm64_relocate_new_kernel,
+  arm64_relocate_new_kernel_size);
+   kimage->arch.kern_reloc = __pa(reloc_code);
+   kexec_image_info(kimage);
+
+   /* Flush the reloc_code in preparation for its execution. */
+   __flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
+   flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
+  arm64_relocate_new_kernel_size);
+
+   return 0;
+}
+
 /**
  * machine_kexec - Do the kexec reboot.
  *
@@ -169,13 +175,6 @@ void machine_kexec(struct kimage *kimage)
WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
"Some CPUs may be stale, kdump will be unreliable.\n");
 
-   /* Flush the kimage list and its buffers. */
-   kexec_list_flush(kimage);
-
-   /* Flush the new image if already in place. */
-   if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE))
-   kexec_segment_flush(kimage);
-
pr_info("Bye!\n");
 
local_daif_mask();
@@ -250,8 +249,6 @@ void arch_kexec_protect_crashkres(void)
 {
int i;
 
-   kexec_segment_flush(kexec_crash_image);
-
for (i = 0; i < kexec_crash_image->nr_segments; i++)
set_memory_valid(
__phys_to_virt(kexec_crash_image->segment[i].mem),
-- 
2.25.1

[PATCH v13 08/18] arm64: kexec: skip relocation code for inplace kexec

2021-04-07 Thread Pavel Tatashin

In case of kdump or when segments are already in place the relocation
is not needed, therefore the setup of relocation function and call to
it can be skipped.

Signed-off-by: Pavel Tatashin 
Suggested-by: James Morse 
---
 arch/arm64/kernel/machine_kexec.c   | 34 ++---
 arch/arm64/kernel/relocate_kernel.S |  3 ---
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 3a034bc25709..b150b65f0b84 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -139,21 +139,23 @@ int machine_kexec_post_load(struct kimage *kimage)
 {
void *reloc_code = page_to_virt(kimage->control_code_page);
 
-   /* If in place flush new kernel image, else flush lists and buffers */
-   if (kimage->head & IND_DONE)
+   /* If in place, relocation is not used, only flush next kernel */
+   if (kimage->head & IND_DONE) {
kexec_segment_flush(kimage);
-   else
-   kexec_list_flush(kimage);
+   kexec_image_info(kimage);
+   return 0;
+   }
 
memcpy(reloc_code, arm64_relocate_new_kernel,
   arm64_relocate_new_kernel_size);
kimage->arch.kern_reloc = __pa(reloc_code);
-   kexec_image_info(kimage);
 
/* Flush the reloc_code in preparation for its execution. */
__flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
   arm64_relocate_new_kernel_size);
+   kexec_list_flush(kimage);
+   kexec_image_info(kimage);
 
return 0;
 }
@@ -180,19 +182,25 @@ void machine_kexec(struct kimage *kimage)
local_daif_mask();
 
/*
-* cpu_soft_restart will shutdown the MMU, disable data caches, then
-* transfer control to the kern_reloc which contains a copy of
-* the arm64_relocate_new_kernel routine.  arm64_relocate_new_kernel
-* uses physical addressing to relocate the new image to its final
-* position and transfers control to the image entry point when the
-* relocation is complete.
+* Both restart and cpu_soft_restart will shutdown the MMU, disable data
+* caches. However, restart will start new kernel or purgatory directly,
+* cpu_soft_restart will transfer control to arm64_relocate_new_kernel
 * In kexec case, kimage->start points to purgatory assuming that
 * kernel entry and dtb address are embedded in purgatory by
 * userspace (kexec-tools).
 * In kexec_file case, the kernel starts directly without purgatory.
 */
-   cpu_soft_restart(kimage->arch.kern_reloc, kimage->head, kimage->start,
-kimage->arch.dtb_mem);
+   if (kimage->head & IND_DONE) {
+   typeof(__cpu_soft_restart) *restart;
+
+   cpu_install_idmap();
+   restart = (void *)__pa_symbol(__cpu_soft_restart);
+   restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
+   0, 0);
+   } else {
+   cpu_soft_restart(kimage->arch.kern_reloc, kimage->head,
+kimage->start, kimage->arch.dtb_mem);
+   }
 
BUG(); /* Should never get here. */
 }
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index b78ea5de97a4..8058fabe0a76 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -32,8 +32,6 @@ SYM_CODE_START(arm64_relocate_new_kernel)
mov x16, x0 /* x16 = kimage_head */
mov x14, xzr/* x14 = entry ptr */
mov x13, xzr/* x13 = copy dest */
-   /* Check if the new image needs relocation. */
-   tbnzx16, IND_DONE_BIT, .Ldone
raw_dcache_line_size x15, x1/* x15 = dcache line size */
 .Lloop:
and x12, x16, PAGE_MASK /* x12 = addr */
@@ -65,7 +63,6 @@ SYM_CODE_START(arm64_relocate_new_kernel)
 .Lnext:
ldr x16, [x14], #8  /* entry = *ptr++ */
tbz x16, IND_DONE_BIT, .Lloop   /* while (!(entry & DONE)) */
-.Ldone:
/* wait for writes from copy_page to finish */
dsb nsh
ic  iallu
-- 
2.25.1

[PATCH v13 06/18] arm64: hibernate: abstract ttrb0 setup function

2021-04-07 Thread Pavel Tatashin

Currently, only hibernate sets custom ttbr0 with safe idmaped function.
Kexec, is also going to be using this functinality when relocation code
is going to be idmapped.

Move the setup seqeuence to a dedicated cpu_install_ttbr0() for custom
ttbr0.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/mmu_context.h | 24 
 arch/arm64/kernel/hibernate.c| 21 +
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h 
b/arch/arm64/include/asm/mmu_context.h
index bd02e99b1a4c..f64d0d5e1b1f 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -115,6 +115,30 @@ static inline void cpu_install_idmap(void)
cpu_switch_mm(lm_alias(idmap_pg_dir), _mm);
 }
 
+/*
+ * Load our new page tables. A strict BBM approach requires that we ensure that
+ * TLBs are free of any entries that may overlap with the global mappings we 
are
+ * about to install.
+ *
+ * For a real hibernate/resume/kexec cycle TTBR0 currently points to a zero
+ * page, but TLBs may contain stale ASID-tagged entries (e.g. for EFI runtime
+ * services), while for a userspace-driven test_resume cycle it points to
+ * userspace page tables (and we must point it at a zero page ourselves).
+ *
+ * We change T0SZ as part of installing the idmap. This is undone by
+ * cpu_uninstall_idmap() in __cpu_suspend_exit().
+ */
+static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
+{
+   cpu_set_reserved_ttbr0();
+   local_flush_tlb_all();
+   __cpu_set_tcr_t0sz(t0sz);
+
+   /* avoid cpu_switch_mm() and its SW-PAN and CNP interactions */
+   write_sysreg(ttbr0, ttbr0_el1);
+   isb();
+}
+
 /*
  * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
  * avoiding the possibility of conflicting TLB entries being allocated.
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 0b8bad8bb6eb..ded5115bcb63 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -206,26 +206,7 @@ static int create_safe_exec_page(void *src_start, size_t 
length,
if (rc)
return rc;
 
-   /*
-* Load our new page tables. A strict BBM approach requires that we
-* ensure that TLBs are free of any entries that may overlap with the
-* global mappings we are about to install.
-*
-* For a real hibernate/resume cycle TTBR0 currently points to a zero
-* page, but TLBs may contain stale ASID-tagged entries (e.g. for EFI
-* runtime services), while for a userspace-driven test_resume cycle it
-* points to userspace page tables (and we must point it at a zero page
-* ourselves).
-*
-* We change T0SZ as part of installing the idmap. This is undone by
-* cpu_uninstall_idmap() in __cpu_suspend_exit().
-*/
-   cpu_set_reserved_ttbr0();
-   local_flush_tlb_all();
-   __cpu_set_tcr_t0sz(t0sz);
-   write_sysreg(trans_ttbr0, ttbr0_el1);
-   isb();
-
+   cpu_install_ttbr0(trans_ttbr0, t0sz);
*phys_dst_addr = virt_to_phys(page);
 
return 0;
-- 
2.25.1

[PATCH v13 05/18] arm64: trans_pgd: hibernate: Add trans_pgd_copy_el2_vectors

2021-04-07 Thread Pavel Tatashin

Users of trans_pgd may also need a copy of vector table because it is
also may be overwritten if a linear map can be overwritten.

Move setup of EL2 vectors from hibernate to trans_pgd, so it can be
later shared with kexec as well.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/trans_pgd.h |  3 +++
 arch/arm64/include/asm/virt.h  |  3 +++
 arch/arm64/kernel/hibernate.c  | 28 ++--
 arch/arm64/mm/trans_pgd.c  | 20 
 4 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/trans_pgd.h 
b/arch/arm64/include/asm/trans_pgd.h
index 5d08e5adf3d5..e0760e52d36d 100644
--- a/arch/arm64/include/asm/trans_pgd.h
+++ b/arch/arm64/include/asm/trans_pgd.h
@@ -36,4 +36,7 @@ int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t 
*trans_pgd,
 int trans_pgd_idmap_page(struct trans_pgd_info *info, phys_addr_t *trans_ttbr0,
 unsigned long *t0sz, void *page);
 
+int trans_pgd_copy_el2_vectors(struct trans_pgd_info *info,
+  phys_addr_t *el2_vectors);
+
 #endif /* _ASM_TRANS_TABLE_H */
diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
index 4216c8623538..bfbb66018114 100644
--- a/arch/arm64/include/asm/virt.h
+++ b/arch/arm64/include/asm/virt.h
@@ -67,6 +67,9 @@
  */
 extern u32 __boot_cpu_mode[2];
 
+extern char __hyp_stub_vectors[];
+#define ARM64_VECTOR_TABLE_LEN SZ_2K
+
 void __hyp_set_vectors(phys_addr_t phys_vector_base);
 void __hyp_reset_vectors(void);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index c764574a1acb..0b8bad8bb6eb 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -48,12 +48,6 @@
  */
 extern int in_suspend;
 
-/* temporary el2 vectors in the __hibernate_exit_text section. */
-extern char hibernate_el2_vectors[];
-
-/* hyp-stub vectors, used to restore el2 during resume from hibernate. */
-extern char __hyp_stub_vectors[];
-
 /*
  * The logical cpu number we should resume on, initialised to a non-cpu
  * number.
@@ -428,6 +422,7 @@ int swsusp_arch_resume(void)
void *zero_page;
size_t exit_size;
pgd_t *tmp_pg_dir;
+   phys_addr_t el2_vectors;
void __noreturn (*hibernate_exit)(phys_addr_t, phys_addr_t, void *,
  void *, phys_addr_t, phys_addr_t);
struct trans_pgd_info trans_info = {
@@ -455,6 +450,14 @@ int swsusp_arch_resume(void)
return -ENOMEM;
}
 
+   if (is_hyp_callable()) {
+   rc = trans_pgd_copy_el2_vectors(_info, _vectors);
+   if (rc) {
+   pr_err("Failed to setup el2 vectors\n");
+   return rc;
+   }
+   }
+
exit_size = __hibernate_exit_text_end - __hibernate_exit_text_start;
/*
 * Copy swsusp_arch_suspend_exit() to a safe page. This will generate
@@ -467,25 +470,14 @@ int swsusp_arch_resume(void)
return rc;
}
 
-   /*
-* The hibernate exit text contains a set of el2 vectors, that will
-* be executed at el2 with the mmu off in order to reload hyp-stub.
-*/
-   __flush_dcache_area(hibernate_exit, exit_size);
-
/*
 * KASLR will cause the el2 vectors to be in a different location in
 * the resumed kernel. Load hibernate's temporary copy into el2.
 *
 * We can skip this step if we booted at EL1, or are running with VHE.
 */
-   if (is_hyp_callable()) {
-   phys_addr_t el2_vectors = (phys_addr_t)hibernate_exit;
-   el2_vectors += hibernate_el2_vectors -
-  __hibernate_exit_text_start; /* offset */
-
+   if (is_hyp_callable())
__hyp_set_vectors(el2_vectors);
-   }
 
hibernate_exit(virt_to_phys(tmp_pg_dir), resume_hdr.ttbr1_el1,
   resume_hdr.reenter_kernel, restore_pblist,
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 527f0a39c3da..61549451ed3a 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -322,3 +322,23 @@ int trans_pgd_idmap_page(struct trans_pgd_info *info, 
phys_addr_t *trans_ttbr0,
 
return 0;
 }
+
+/*
+ * Create a copy of the vector table so we can call HVC_SET_VECTORS or
+ * HVC_SOFT_RESTART from contexts where the table may be overwritten.
+ */
+int trans_pgd_copy_el2_vectors(struct trans_pgd_info *info,
+  phys_addr_t *el2_vectors)
+{
+   void *hyp_stub = trans_alloc(info);
+
+   if (!hyp_stub)
+   return -ENOMEM;
+   *el2_vectors = virt_to_phys(hyp_stub);
+   memcpy(hyp_stub, &__hyp_stub_vectors, ARM64_VECTOR_TABLE_LEN);
+   __flush_icache_range((unsigned long)hyp_stub,
+(unsigned long)hyp_stub + ARM64_VECTOR_TABLE_LEN);
+   __flu

[PATCH v13 04/18] arm64: kernel: add helper for booted at EL2 and not VHE

2021-04-07 Thread Pavel Tatashin

Replace places that contain logic like this:
is_hyp_mode_available() && !is_kernel_in_hyp_mode()

With a dedicated boolean function  is_hyp_callable(). This will be needed
later in kexec in order to sooner switch back to EL2.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/virt.h | 5 +
 arch/arm64/kernel/cpu-reset.h | 3 +--
 arch/arm64/kernel/hibernate.c | 9 +++--
 arch/arm64/kernel/sdei.c  | 2 +-
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
index 7379f35ae2c6..4216c8623538 100644
--- a/arch/arm64/include/asm/virt.h
+++ b/arch/arm64/include/asm/virt.h
@@ -128,6 +128,11 @@ static __always_inline bool is_protected_kvm_enabled(void)
return cpus_have_final_cap(ARM64_KVM_PROTECTED_MODE);
 }
 
+static inline bool is_hyp_callable(void)
+{
+   return is_hyp_mode_available() && !is_kernel_in_hyp_mode();
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* ! __ASM__VIRT_H */
diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index ed50e9587ad8..1922e7a690f8 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -20,8 +20,7 @@ static inline void __noreturn cpu_soft_restart(unsigned long 
entry,
 {
typeof(__cpu_soft_restart) *restart;
 
-   unsigned long el2_switch = !is_kernel_in_hyp_mode() &&
-   is_hyp_mode_available();
+   unsigned long el2_switch = is_hyp_callable();
restart = (void *)__pa_symbol(__cpu_soft_restart);
 
cpu_install_idmap();
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index b1cef371df2b..c764574a1acb 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -48,9 +48,6 @@
  */
 extern int in_suspend;
 
-/* Do we need to reset el2? */
-#define el2_reset_needed() (is_hyp_mode_available() && 
!is_kernel_in_hyp_mode())
-
 /* temporary el2 vectors in the __hibernate_exit_text section. */
 extern char hibernate_el2_vectors[];
 
@@ -125,7 +122,7 @@ int arch_hibernation_header_save(void *addr, unsigned int 
max_size)
hdr->reenter_kernel = _cpu_resume;
 
/* We can't use __hyp_get_vectors() because kvm may still be loaded */
-   if (el2_reset_needed())
+   if (is_hyp_callable())
hdr->__hyp_stub_vectors = __pa_symbol(__hyp_stub_vectors);
else
hdr->__hyp_stub_vectors = 0;
@@ -387,7 +384,7 @@ int swsusp_arch_suspend(void)
dcache_clean_range(__idmap_text_start, __idmap_text_end);
 
/* Clean kvm setup code to PoC? */
-   if (el2_reset_needed()) {
+   if (is_hyp_callable()) {
dcache_clean_range(__hyp_idmap_text_start, 
__hyp_idmap_text_end);
dcache_clean_range(__hyp_text_start, __hyp_text_end);
}
@@ -482,7 +479,7 @@ int swsusp_arch_resume(void)
 *
 * We can skip this step if we booted at EL1, or are running with VHE.
 */
-   if (el2_reset_needed()) {
+   if (is_hyp_callable()) {
phys_addr_t el2_vectors = (phys_addr_t)hibernate_exit;
el2_vectors += hibernate_el2_vectors -
   __hibernate_exit_text_start; /* offset */
diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
index 2c7ca449dd51..af0ac2f920cf 100644
--- a/arch/arm64/kernel/sdei.c
+++ b/arch/arm64/kernel/sdei.c
@@ -200,7 +200,7 @@ unsigned long sdei_arch_get_entry_point(int conduit)
 * dropped to EL1 because we don't support VHE, then we can't support
 * SDEI.
 */
-   if (is_hyp_mode_available() && !is_kernel_in_hyp_mode()) {
+   if (is_hyp_callable()) {
pr_err("Not supported on this hardware/boot configuration\n");
goto out_err;
}
-- 
2.25.1

[PATCH v13 03/18] arm64: hyp-stub: Move el1_sync into the vectors

2021-04-07 Thread Pavel Tatashin

From: James Morse 

The hyp-stub's el1_sync code doesn't do very much, this can easily fit
in the vectors.

With this, all of the hyp-stubs behaviour is contained in its vectors.
This lets kexec and hibernate copy the hyp-stub when they need its
behaviour, instead of re-implementing it.

Signed-off-by: James Morse 

[Fixed merging issues]

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/hyp-stub.S | 59 ++--
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index ff329c5c074d..d1a73d0f74e0 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -21,6 +21,34 @@ SYM_CODE_START_LOCAL(\label)
.align 7
b   \label
 SYM_CODE_END(\label)
+.endm
+
+.macro hyp_stub_el1_sync
+SYM_CODE_START_LOCAL(hyp_stub_el1_sync)
+   .align 7
+   cmp x0, #HVC_SET_VECTORS
+   b.ne2f
+   msr vbar_el2, x1
+   b   9f
+
+2: cmp x0, #HVC_SOFT_RESTART
+   b.ne3f
+   mov x0, x2
+   mov x2, x4
+   mov x4, x1
+   mov x1, x3
+   br  x4  // no return
+
+3: cmp x0, #HVC_RESET_VECTORS
+   beq 9f  // Nothing to reset!
+
+   /* Someone called kvm_call_hyp() against the hyp-stub... */
+   mov_q   x0, HVC_STUB_ERR
+   eret
+
+9: mov x0, xzr
+   eret
+SYM_CODE_END(hyp_stub_el1_sync)
 .endm
 
.text
@@ -39,7 +67,7 @@ SYM_CODE_START(__hyp_stub_vectors)
invalid_vector  hyp_stub_el2h_fiq_invalid   // FIQ EL2h
invalid_vector  hyp_stub_el2h_error_invalid // Error EL2h
 
-   ventry  el1_sync// Synchronous 64-bit EL1
+   hyp_stub_el1_sync   // Synchronous 64-bit 
EL1
invalid_vector  hyp_stub_el1_irq_invalid// IRQ 64-bit EL1
invalid_vector  hyp_stub_el1_fiq_invalid// FIQ 64-bit EL1
invalid_vector  hyp_stub_el1_error_invalid  // Error 64-bit EL1
@@ -55,35 +83,6 @@ SYM_CODE_END(__hyp_stub_vectors)
 # Check the __hyp_stub_vectors didn't overflow
 .org . - (__hyp_stub_vectors_end - __hyp_stub_vectors) + SZ_2K
 
-
-SYM_CODE_START_LOCAL(el1_sync)
-   cmp x0, #HVC_SET_VECTORS
-   b.ne1f
-   msr vbar_el2, x1
-   b   9f
-
-1: cmp x0, #HVC_VHE_RESTART
-   b.eqmutate_to_vhe
-
-2: cmp x0, #HVC_SOFT_RESTART
-   b.ne3f
-   mov x0, x2
-   mov x2, x4
-   mov x4, x1
-   mov x1, x3
-   br  x4  // no return
-
-3: cmp x0, #HVC_RESET_VECTORS
-   beq 9f  // Nothing to reset!
-
-   /* Someone called kvm_call_hyp() against the hyp-stub... */
-   mov_q   x0, HVC_STUB_ERR
-   eret
-
-9: mov x0, xzr
-   eret
-SYM_CODE_END(el1_sync)
-
 // nVHE? No way! Give me the real thing!
 SYM_CODE_START_LOCAL(mutate_to_vhe)
// Sanity check: MMU *must* be off
-- 
2.25.1

[PATCH v13 02/18] arm64: hyp-stub: Move invalid vector entries into the vectors

2021-04-07 Thread Pavel Tatashin

From: James Morse 

Most of the hyp-stub's vector entries are invalid. These are each
a unique function that branches to itself. To move these into the
vectors, merge the ventry and invalid_vector macros and give each
one a unique name.

This means we can copy the hyp-stub as it is self contained within
its vectors.

Signed-off-by: James Morse 

[Fixed merging issues]

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/hyp-stub.S | 56 +++-
 1 file changed, 23 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index 572b28646005..ff329c5c074d 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -16,31 +16,38 @@
 #include 
 #include 
 
+.macro invalid_vector  label
+SYM_CODE_START_LOCAL(\label)
+   .align 7
+   b   \label
+SYM_CODE_END(\label)
+.endm
+
.text
.pushsection.hyp.text, "ax"
 
.align 11
 
 SYM_CODE_START(__hyp_stub_vectors)
-   ventry  el2_sync_invalid// Synchronous EL2t
-   ventry  el2_irq_invalid // IRQ EL2t
-   ventry  el2_fiq_invalid // FIQ EL2t
-   ventry  el2_error_invalid   // Error EL2t
+   invalid_vector  hyp_stub_el2t_sync_invalid  // Synchronous EL2t
+   invalid_vector  hyp_stub_el2t_irq_invalid   // IRQ EL2t
+   invalid_vector  hyp_stub_el2t_fiq_invalid   // FIQ EL2t
+   invalid_vector  hyp_stub_el2t_error_invalid // Error EL2t
 
-   ventry  el2_sync_invalid// Synchronous EL2h
-   ventry  el2_irq_invalid // IRQ EL2h
-   ventry  el2_fiq_invalid // FIQ EL2h
-   ventry  el2_error_invalid   // Error EL2h
+   invalid_vector  hyp_stub_el2h_sync_invalid  // Synchronous EL2h
+   invalid_vector  hyp_stub_el2h_irq_invalid   // IRQ EL2h
+   invalid_vector  hyp_stub_el2h_fiq_invalid   // FIQ EL2h
+   invalid_vector  hyp_stub_el2h_error_invalid // Error EL2h
 
ventry  el1_sync// Synchronous 64-bit EL1
-   ventry  el1_irq_invalid // IRQ 64-bit EL1
-   ventry  el1_fiq_invalid // FIQ 64-bit EL1
-   ventry  el1_error_invalid   // Error 64-bit EL1
-
-   ventry  el1_sync_invalid// Synchronous 32-bit EL1
-   ventry  el1_irq_invalid // IRQ 32-bit EL1
-   ventry  el1_fiq_invalid // FIQ 32-bit EL1
-   ventry  el1_error_invalid   // Error 32-bit EL1
+   invalid_vector  hyp_stub_el1_irq_invalid// IRQ 64-bit EL1
+   invalid_vector  hyp_stub_el1_fiq_invalid// FIQ 64-bit EL1
+   invalid_vector  hyp_stub_el1_error_invalid  // Error 64-bit EL1
+
+   invalid_vector  hyp_stub_32b_el1_sync_invalid   // Synchronous 32-bit 
EL1
+   invalid_vector  hyp_stub_32b_el1_irq_invalid// IRQ 32-bit EL1
+   invalid_vector  hyp_stub_32b_el1_fiq_invalid// FIQ 32-bit EL1
+   invalid_vector  hyp_stub_32b_el1_error_invalid  // Error 32-bit EL1
.align 11
 SYM_INNER_LABEL(__hyp_stub_vectors_end, SYM_L_LOCAL)
 SYM_CODE_END(__hyp_stub_vectors)
@@ -173,23 +180,6 @@ SYM_CODE_END(enter_vhe)
 
.popsection
 
-.macro invalid_vector  label
-SYM_CODE_START_LOCAL(\label)
-   b \label
-SYM_CODE_END(\label)
-.endm
-
-   invalid_vector  el2_sync_invalid
-   invalid_vector  el2_irq_invalid
-   invalid_vector  el2_fiq_invalid
-   invalid_vector  el2_error_invalid
-   invalid_vector  el1_sync_invalid
-   invalid_vector  el1_irq_invalid
-   invalid_vector  el1_fiq_invalid
-   invalid_vector  el1_error_invalid
-
-   .popsection
-
 /*
  * __hyp_set_vectors: Call this after boot to set the initial hypervisor
  * vectors as part of hypervisor installation.  On an SMP system, this should
-- 
2.25.1

[PATCH v13 01/18] arm64: hyp-stub: Check the size of the HYP stub's vectors

2021-04-07 Thread Pavel Tatashin

From: James Morse 

Hibernate contains a set of temporary EL2 vectors used to 'park'
EL2 somewhere safe while all the memory is thrown in the air.
Making kexec do its relocations with the MMU on means they have to
be done at EL1, so EL2 has to be parked. This means yet another
set of vectors.

All these things do is HVC_SET_VECTORS and HVC_SOFT_RESTART, both
of which are implemented by the hyp-stub. Lets copy it instead
of re-inventing it.

To do this the hyp-stub's entrails need to be packed neatly inside
its 2K vectors.

Start by moving the final 2K alignment inside the end marker, and
add a build check that we didn't overflow 2K.

Signed-off-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/hyp-stub.S | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index 5eccbd62fec8..572b28646005 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -41,9 +41,13 @@ SYM_CODE_START(__hyp_stub_vectors)
ventry  el1_irq_invalid // IRQ 32-bit EL1
ventry  el1_fiq_invalid // FIQ 32-bit EL1
ventry  el1_error_invalid   // Error 32-bit EL1
+   .align 11
+SYM_INNER_LABEL(__hyp_stub_vectors_end, SYM_L_LOCAL)
 SYM_CODE_END(__hyp_stub_vectors)
 
-   .align 11
+# Check the __hyp_stub_vectors didn't overflow
+.org . - (__hyp_stub_vectors_end - __hyp_stub_vectors) + SZ_2K
+
 
 SYM_CODE_START_LOCAL(el1_sync)
cmp x0, #HVC_SET_VECTORS
-- 
2.25.1

[PATCH v13 00/18] arm64: MMU enabled kexec relocation

2021-04-07 Thread Pavel Tatashin

by's
- Moved "add PUD_SECT_RDONLY" earlier in series to be with other
  clean-ups
- Added "Derived from:" to arch/arm64/mm/trans_pgd.c
- Removed "flags" from trans_info
- Changed .trans_alloc_page assumption to return zeroed page.
- Simplify changes to trans_pgd_map_page(), by keeping the old
  code.
- Simplify changes to trans_pgd_create_copy, by keeping the old
  code.
- Removed: "add trans_pgd_create_empty"
- replace init_mm with NULL, and keep using non "__" version of
  populate functions.
v3:
- Split changes to create_safe_exec_page() into several patches for
  easier review as request by Mark Rutland. This is why this series
  has 3 more patches.
- Renamed trans_table to tans_pgd as agreed with Mark. The header
  comment in trans_pgd.c explains that trans stands for
  transitional page tables. Meaning they are used in transition
  between two kernels.
v2:
- Fixed hibernate bug reported by James Morse
- Addressed comments from James Morse:
  * More incremental changes to trans_table
  * Removed TRANS_FORCEMAP
  * Added kexec reboot data for image with 380M in size.

Enable MMU during kexec relocation in order to improve reboot performance.

If kexec functionality is used for a fast system update, with a minimal
downtime, the relocation of kernel + initramfs takes a significant portion
of reboot.

The reason for slow relocation is because it is done without MMU, and thus
not benefiting from D-Cache.

Performance data


Cavium ThunderX2:
Kernel Image size: 38M Iniramfs size: 46M Total relocation size: 84M
MMU-disabled:
relocation  7.489539915s
MMU-enabled:
relocation  0.03946095s

Relocation performance is improved 190 times.

Broadcom Stingray:
For this experiment, the size of kernel plus initramfs is small, only 25M.
If initramfs was larger, than the improvements would be greater, as time
spent in relocation is proportional to the size of relocation.

MMU-disabled::
kernel shutdown 0.022131328s
relocation  0.440510736s
kernel startup  0.294706768s

Relocation was taking: 58.2% of reboot time

MMU-enabled:
kernel shutdown 0.032066576s
relocation  0.022158152s
kernel startup  0.296055880s

Now: Relocation takes 6.3% of reboot time

Total reboot is x2.16 times faster.

With bigger userland (fitImage 380M), the reboot time is improved by 3.57s,
and is reduced from 3.9s down to 0.33s

Previous approaches and discussions
---
v12: 
https://lore.kernel.org/lkml/20210303002230.1083176-1-pasha.tatas...@soleen.com
v11: 
https://lore.kernel.org/lkml/20210127172706.617195-1-pasha.tatas...@soleen.com
v10: 
https://lore.kernel.org/linux-arm-kernel/20210125191923.1060122-1-pasha.tatas...@soleen.com
v9: 
https://lore.kernel.org/lkml/20200326032420.27220-1-pasha.tatas...@soleen.com
v8: 
https://lore.kernel.org/lkml/20191204155938.2279686-1-pasha.tatas...@soleen.com
v7: 
https://lore.kernel.org/lkml/20191016200034.1342308-1-pasha.tatas...@soleen.com
v6: 
https://lore.kernel.org/lkml/20191004185234.31471-1-pasha.tatas...@soleen.com
v5: 
https://lore.kernel.org/lkml/20190923203427.294286-1-pasha.tatas...@soleen.com
v4: 
https://lore.kernel.org/lkml/20190909181221.309510-1-pasha.tatas...@soleen.com
v3: 
https://lore.kernel.org/lkml/20190821183204.23576-1-pasha.tatas...@soleen.com
v2: 
https://lore.kernel.org/lkml/20190817024629.26611-1-pasha.tatas...@soleen.com
v1: 
https://lore.kernel.org/lkml/20190801152439.11363-1-pasha.tatas...@soleen.com

James Morse (4):
  arm64: hyp-stub: Check the size of the HYP stub's vectors
  arm64: hyp-stub: Move invalid vector entries into the vectors
  arm64: hyp-stub: Move el1_sync into the vectors
  arm64: kexec: Use dcache ops macros instead of open-coding

Pavel Tatashin (13):
  arm64: kernel: add helper for booted at EL2 and not VHE
  arm64: trans_pgd: hibernate: Add trans_pgd_copy_el2_vectors
  arm64: hibernate: abstract ttrb0 setup function
  arm64: kexec: flush image and lists during kexec load time
  arm64: kexec: skip relocation code for inplace kexec
  arm64: kexec: pass kimage as the only argument to relocation function
  arm64: kexec: kexec may require EL2 vectors
  arm64: kexec: relocate in EL1 mode
  arm64: kexec: use ld script for relocation function
  arm64: kexec: install a copy of the linear-map
  arm64: kexec: keep MMU enabled during kexec relocation
  arm64: kexec: remove the pre-kexec PoC maintenance
  arm64: kexec: Remove cpu-reset.h

Pingfan Liu (1):
  arm64/mm: remove useless trans_pgd_map_page()

 arch/arm64/Kconfig   |   2 +-
 arch/arm64/include/asm/assembler.h   |  31 -
 arch/arm64/include/asm/kexec.h   |  12 ++
 arch/arm64/include/asm/mmu_context.h |  24 
 arch/arm64/include/asm/sections.h|   1 +
 arch/arm64/include/asm/trans_pgd.h

Re: [PATCH] mm/hugeltb: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

2021-04-01 Thread Pavel Tatashin

> > Andrew, since "mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN" is
> > not yet in the mainline, should I send a new version of this patch so
> > we won't have bisecting problems in the future?
>
> I've already added Mike's fix, as
> mm-cma-rename-pf_memalloc_nocma-to-pf_memalloc_pin-fix.patch.  It shall
> fold it into mm-cma-rename-pf_memalloc_nocma-to-pf_memalloc_pin.patch
> prior to upstreaming, so no bisection issue.

Great, thank you!

Pasha

Re: [PATCH] mm/hugeltb: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

2021-04-01 Thread Pavel Tatashin

On Wed, Mar 31, 2021 at 12:38 PM Mike Rapoport  wrote:
>
> From: Mike Rapoport 
>
> The renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN missed one occurrence
> in mm/hugetlb.c which causes build error:
>
>   CC  mm/hugetlb.o
> mm/hugetlb.c: In function ‘dequeue_huge_page_node_exact’:
> mm/hugetlb.c:1081:33: error: ‘PF_MEMALLOC_NOCMA’ undeclared (first use in 
> this function); did you mean ‘PF_MEMALLOC_NOFS’?
>   bool pin = !!(current->flags & PF_MEMALLOC_NOCMA);
>  ^
>  PF_MEMALLOC_NOFS
> mm/hugetlb.c:1081:33: note: each undeclared identifier is reported only once 
> for each function it appears in
> scripts/Makefile.build:273: recipe for target 'mm/hugetlb.o' failed
> make[2]: *** [mm/hugetlb.o] Error 1
>
> Signed-off-by: Mike Rapoport 
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a5236c2f7bb2..c22111f3da20 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1078,7 +1078,7 @@ static void enqueue_huge_page(struct hstate *h, struct 
> page *page)
>  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>  {
> struct page *page;
> -   bool pin = !!(current->flags & PF_MEMALLOC_NOCMA);
> +   bool pin = !!(current->flags & PF_MEMALLOC_PIN);

Thank you Mike!

Andrew, since "mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN" is
not yet in the mainline, should I send a new version of this patch so
we won't have bisecting problems in the future?

Thank you,
Pasha

[PATCH] [Backport for stable 5.11] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-03-29 Thread Pavel Tatashin

commit ee7febce051945be28ad86d16a15886f878204de upstream.

Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
linear map range is not checked correctly.

The start physical address that linear map covers can be actually at the
end of the range because of randomization. Check that and if so reduce it
to 0.

This can be verified on QEMU with setting kaslr-seed to ~0ul:

memstart_offset_seed = 0x
START: __pa(_PAGE_OFFSET(vabits_actual)) = 9000c000
END:   __pa(PAGE_END - 1) =  1000bfff

Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear 
mapping")
Signed-off-by: Pavel Tatashin 
Tested-by: Tyler Hicks 
Reviewed-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6f0648777d34..ee01f421e1e4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1445,14 +1445,30 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned 
long start, u64 size)
 
 static bool inside_linear_region(u64 start, u64 size)
 {
+   u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
+   u64 end_linear_pa = __pa(PAGE_END - 1);
+
+   if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
+   /*
+* Check for a wrap, it is possible because of randomized linear
+* mapping the start physical address is actually bigger than
+* the end physical address. In this case set start to zero
+* because [0, end_linear_pa] range must still be able to cover
+* all addressable physical addresses.
+*/
+   if (start_linear_pa > end_linear_pa)
+   start_linear_pa = 0;
+   }
+
+   WARN_ON(start_linear_pa > end_linear_pa);
+
/*
 * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
 * accommodating both its ends but excluding PAGE_END. Max physical
 * range which can be mapped inside this linear mapping range, must
 * also be derived from its end points.
 */
-   return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
-  (start + size - 1) <= __pa(PAGE_END - 1);
+   return start >= start_linear_pa && (start + size - 1) <= end_linear_pa;
 }
 
 int arch_add_memory(int nid, u64 start, u64 size,
-- 
2.25.1

Re: [PATCH 5.11 225/254] arm64/mm: define arch_get_mappable_range()

2021-03-29 Thread Pavel Tatashin

On Mon, Mar 29, 2021 at 9:51 AM Greg Kroah-Hartman
 wrote:
>
> On Mon, Mar 29, 2021 at 03:49:19PM +0200, Ard Biesheuvel wrote:
> > (+ Pavel)
> >
> > On Mon, 29 Mar 2021 at 15:42, Greg Kroah-Hartman
> >  wrote:
> > >
> > > On Mon, Mar 29, 2021 at 03:08:52PM +0200, Ard Biesheuvel wrote:
> > > > On Mon, 29 Mar 2021 at 12:12, Greg Kroah-Hartman
> > > >  wrote:
> > > > >
> > > > > On Mon, Mar 29, 2021 at 03:05:25PM +0530, Naresh Kamboju wrote:
> > > > > > On Mon, 29 Mar 2021 at 14:10, Greg Kroah-Hartman
> > > > > >  wrote:
> > > > > > >
> > > > > > > From: Anshuman Khandual 
> > > > > > >
> > > > > > > [ Upstream commit 03aaf83fba6e5af08b5dd174c72edee9b7d9ed9b ]
> > > > > > >
> > > > > > > This overrides arch_get_mappable_range() on arm64 platform which 
> > > > > > > will be
> > > > > > > used with recently added generic framework.  It drops
> > > > > > > inside_linear_region() and subsequent check in arch_add_memory() 
> > > > > > > which are
> > > > > > > no longer required.  It also adds a VM_BUG_ON() check that would 
> > > > > > > ensure
> > > > > > > that mhp_range_allowed() has already been called.
> > > > > > >
> > > > > > > Link: 
> > > > > > > https://lkml.kernel.org/r/1612149902-7867-3-git-send-email-anshuman.khand...@arm.com
> > > > > > > Signed-off-by: Anshuman Khandual 
> > > > > > > Reviewed-by: David Hildenbrand 
> > > > > > > Reviewed-by: Catalin Marinas 
> > > > > > > Cc: Will Deacon 
> > > > > > > Cc: Ard Biesheuvel 
> > > > > > > Cc: Mark Rutland 
> > > > > > > Cc: Heiko Carstens 
> > > > > > > Cc: Jason Wang 
> > > > > > > Cc: Jonathan Cameron 
> > > > > > > Cc: "Michael S. Tsirkin" 
> > > > > > > Cc: Michal Hocko 
> > > > > > > Cc: Oscar Salvador 
> > > > > > > Cc: Pankaj Gupta 
> > > > > > > Cc: Pankaj Gupta 
> > > > > > > Cc: teawater 
> > > > > > > Cc: Vasily Gorbik 
> > > > > > > Cc: Wei Yang 
> > > > > > > Signed-off-by: Andrew Morton 
> > > > > > > Signed-off-by: Linus Torvalds 
> > > > > > > Signed-off-by: Sasha Levin 
> > > > > > > ---
> > > > > > >  arch/arm64/mm/mmu.c | 15 +++
> > > > > > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > > > > > >
> > > > > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > > > > > index 6f0648777d34..92b3be127796 100644
> > > > > > > --- a/arch/arm64/mm/mmu.c
> > > > > > > +++ b/arch/arm64/mm/mmu.c
> > > > > > > @@ -1443,16 +1443,19 @@ static void __remove_pgd_mapping(pgd_t 
> > > > > > > *pgdir, unsigned long start, u64 size)
> > > > > > > free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
> > > > > > >  }
> > > > > > >
> > > > > > > -static bool inside_linear_region(u64 start, u64 size)
> > > > > > > +struct range arch_get_mappable_range(void)
> > > > > > >  {
> > > > > > > +   struct range mhp_range;
> > > > > > > +
> > > > > > > /*
> > > > > > >  * Linear mapping region is the range 
> > > > > > > [PAGE_OFFSET..(PAGE_END - 1)]
> > > > > > >  * accommodating both its ends but excluding PAGE_END. 
> > > > > > > Max physical
> > > > > > >  * range which can be mapped inside this linear mapping 
> > > > > > > range, must
> > > > > > >  * also be derived from its end points.
> > > > > > >  */
> > > > > > > -   return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
> > > > > > > -  (start + size - 1) <= __pa(PAGE_END - 1);
> > > > > > > +   mhp_range.start = __pa(_PAGE_OFFSET(vabits_actual));
> > > > > > > +   mhp_range.end =  __pa(PAGE_END - 1);
> > > > > > > +   return mhp_range;
> > > > > > >  }
> > > > > > >
> > > > > > >  int arch_add_memory(int nid, u64 start, u64 size,
> > > > > > > @@ -1460,11 +1463,7 @@ int arch_add_memory(int nid, u64 start, 
> > > > > > > u64 size,
> > > > > > >  {
> > > > > > > int ret, flags = 0;
> > > > > > >
> > > > > > > -   if (!inside_linear_region(start, size)) {
> > > > > > > -   pr_err("[%llx %llx] is outside linear mapping 
> > > > > > > region\n", start, start + size);
> > > > > > > -   return -EINVAL;
> > > > > > > -   }
> > > > > > > -
> > > > > > > +   VM_BUG_ON(!mhp_range_allowed(start, size, true));
> > > > > > > if (rodata_full || debug_pagealloc_enabled())
> > > > > > > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > > > > >
> > > > > > The stable rc 5.10 and 5.11 builds failed for arm64 architecture
> > > > > > due to below warnings / errors,
> > > > > >
> > > > > > > Anshuman Khandual 
> > > > > > > arm64/mm: define arch_get_mappable_range()
> > > > > >
> > > > > >
> > > > > >   arch/arm64/mm/mmu.c: In function 'arch_add_memory':
> > > > > >   arch/arm64/mm/mmu.c:1483:13: error: implicit declaration of 
> > > > > > function
> > > > > > 'mhp_range_allowed'; did you mean 'cpu_map_prog_allowed'?
> > > > > > [-Werror=implicit-function-declaration]
> > > > > > VM_BUG_ON(!mhp_range_allowed(start, size, true));
> > > > > >^
> > > > > >   include/linux/build_bug.h:30:63: note: in definition of

Re: [PATCH] libnvdimm/region: Allow setting align attribute on regions without mappings

2021-03-26 Thread Pavel Tatashin

On Fri, Mar 26, 2021 at 11:27 AM Tyler Hicks
 wrote:
>
> The alignment constraint for namespace creation in a region was
> increased, from 2M to 16M, for non-PowerPC architectures in v5.7 with
> commit 2522afb86a8c ("libnvdimm/region: Introduce an 'align'
> attribute"). The thought behind the change was that region alignment
> should be uniform across all architectures and, since PowerPC had the
> largest alignment constraint of 16M, all architectures should conform to
> that alignment.
>
> The change regressed namespace creation in pre-defined regions that
> relied on 2M alignment but a workaround was provided in the form of a
> sysfs attribute, named 'align', that could be adjusted to a non-default
> alignment value.
>
> However, the sysfs attribute's store function returned an error (-ENXIO)
> when userspace attempted to change the alignment of a region that had no
> mappings. This affected 2M aligned regions of volatile memory that were
> defined in a device tree using "pmem-region" and created by the
> of_pmem_region_driver, since those regions do not contain mappings
> (ndr_mappings is 0).
>
> Allow userspace to set the align attribute on pre-existing regions that
> do not have mappings so that namespaces can still be within those
> regions, despite not being aligned to 16M.
>
> Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute")
> Signed-off-by: Tyler Hicks 

This solves the problem that I had in this thread:
https://lore.kernel.org/lkml/ca+ck2bcd13jblmxn2mauryvqgkbs5ic2uqyssxxtccszxcm...@mail.gmail.com/

Thank you Tyler for root causing and finding a proper fix.

Reviewed-by: Pavel Tatashin 

> ---
>  drivers/nvdimm/region_devs.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index ef23119db574..09cff8aa6b40 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -545,29 +545,32 @@ static ssize_t align_store(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t len)
>  {
> struct nd_region *nd_region = to_nd_region(dev);
> -   unsigned long val, dpa;
> -   u32 remainder;
> +   unsigned long val;
> int rc;
>
> rc = kstrtoul(buf, 0, );
> if (rc)
> return rc;
>
> -   if (!nd_region->ndr_mappings)
> -   return -ENXIO;
> -
> -   /*
> -* Ensure space-align is evenly divisible by the region
> -* interleave-width because the kernel typically has no facility
> -* to determine which DIMM(s), dimm-physical-addresses, would
> -* contribute to the tail capacity in system-physical-address
> -* space for the namespace.
> -*/
> -   dpa = div_u64_rem(val, nd_region->ndr_mappings, );
> -   if (!is_power_of_2(dpa) || dpa < PAGE_SIZE
> -   || val > region_size(nd_region) || remainder)
> +   if (val > region_size(nd_region))
> return -EINVAL;
>
> +   if (nd_region->ndr_mappings) {
> +   unsigned long dpa;
> +   u32 remainder;
> +
> +   /*
> +* Ensure space-align is evenly divisible by the region
> +* interleave-width because the kernel typically has no 
> facility
> +* to determine which DIMM(s), dimm-physical-addresses, would
> +* contribute to the tail capacity in system-physical-address
> +* space for the namespace.
> +*/
> +   dpa = div_u64_rem(val, nd_region->ndr_mappings, );
> +   if (!is_power_of_2(dpa) || dpa < PAGE_SIZE || remainder)
> +   return -EINVAL;
> +   }
> +
> /*
>  * Given that space allocation consults this value multiple
>  * times ensure it does not change for the duration of the
> --
> 2.25.1
>

Re: [PATCH v2] loop: call __loop_clr_fd() with lo_mutex locked to avoid autoclear race

2021-03-26 Thread Pavel Tatashin

On Fri, Mar 26, 2021 at 5:00 AM  wrote:
>
> From: Zqiang 
>
> lo->lo_refcnt = 0
>
> CPU0 CPU1
> lo_open()lo_open()
>  mutex_lock(>lo_mutex)
>  atomic_inc(>lo_refcnt)
>  lo_refcnt == 1
>  mutex_unlock(>lo_mutex)
>  mutex_lock(>lo_mutex)
>  atomic_inc(>lo_refcnt)
>  lo_refcnt == 2
>  mutex_unlock(>lo_mutex)
> loop_clr_fd()
>  mutex_lock(>lo_mutex)
>  atomic_read(>lo_refcnt) > 1
>  lo->lo_flags |= LO_FLAGS_AUTOCLEARlo_release()
>  mutex_unlock(>lo_mutex)
>  return  mutex_lock(>lo_mutex)
>atomic_dec_return(>lo_refcnt)
>  lo_refcnt == 1
>  mutex_unlock(>lo_mutex)
>  return
>
> lo_release()
>  mutex_lock(>lo_mutex)
>  atomic_dec_return(>lo_refcnt)
>  lo_refcnt == 0
>  lo->lo_flags & LO_FLAGS_AUTOCLEAR
>   == true
>  mutex_unlock(>lo_mutex)  loop_control_ioctl()
>case LOOP_CTL_REMOVE:
> mutex_lock(>lo_mutex)
> atomic_read(>lo_refcnt)==0
>   __loop_clr_fd(lo, true)   mutex_unlock(>lo_mutex)
> mutex_lock(>lo_mutex)loop_remove(lo)
>mutex_destroy(>lo_mutex)
>   ..   kfree(lo)
>data race
>
> When different tasks on two CPUs perform the above operations on the same
> lo device, data race may be occur, Do not drop lo->lo_mutex before calling
>  __loop_clr_fd(), so refcnt and LO_FLAGS_AUTOCLEAR check in lo_release
> stay in sync.

There is a race with autoclear logic where use after free may occur as
shown in the above scenario. Do not drop lo->lo_mutex before calling
__loop_clr_fd(), so refcnt and LO_FLAGS_AUTOCLEAR check in lo_release
stay in sync.

Reviewed-by: Pavel Tatashin 

>
> Fixes: 6cc8e7430801 ("loop: scale loop device by introducing per device lock")
> Signed-off-by: Zqiang 
> ---
>  v1->v2:
>  Modify the title and commit message.
>
>  drivers/block/loop.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index d58d68f3c7cd..5712f1698a66 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1201,7 +1201,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
> release)
> bool partscan = false;
> int lo_number;
>
> -   mutex_lock(>lo_mutex);
> if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
> err = -ENXIO;
> goto out_unlock;
> @@ -1257,7 +1256,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
> release)
> lo_number = lo->lo_number;
> loop_unprepare_queue(lo);
>  out_unlock:
> -   mutex_unlock(>lo_mutex);
> if (partscan) {
> /*
>  * bd_mutex has been held already in release path, so don't
> @@ -1288,12 +1286,11 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
> release)
>  * protects us from all the other places trying to change the 'lo'
>  * device.
>  */
> -   mutex_lock(>lo_mutex);
> +
> lo->lo_flags = 0;
> if (!part_shift)
> lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
> lo->lo_state = Lo_unbound;
> -   mutex_unlock(>lo_mutex);
>
> /*
>  * Need not hold lo_mutex to fput backing file. Calling fput holding
> @@ -1332,9 +1329,10 @@ static int loop_clr_fd(struct loop_device *lo)
> return 0;
> }
> lo->lo_state = Lo_rundown;
> +   err = __loop_clr_fd(lo, false);
> mutex_unlock(>lo_mutex);
>
> -   return __loop_clr_fd(lo, false);
> +   return err;
>  }
>
>  static int
> @@ -1916,13 +1914,12 @@ static void lo_release(struct gendisk *disk, fmode_t 
> mode)
> if (lo->lo_state != Lo_bound)
> goto out_unlock;
> lo->lo_state = Lo_rundown;
> -   mutex_unlock(>lo_mutex);
> /*
>  * In autoclear mode, stop the loop thread
>  * and remove configuration after last close.
>  */
> __loop_clr_fd(lo, true);
> -   return;
> +   goto out_unlock;
> } else if (lo->lo_state == Lo_bound) {
> /*
>  * Otherwise keep thread (if running) and config,
> --
> 2.17.1
>

Re: [PATCH] loop: Fix use of unsafe lo->lo_mutex locks

2021-03-25 Thread Pavel Tatashin

Hi Qiang,

Thank you for root causing this issue. Did you encounter this issue or
found by inspection?

I would change the title to what actually being changed, something like:

loop: call __loop_clr_fd() with lo_mutex locked to avoid autoclear race


>   ..   kfree(lo)
>UAF
>
> When different tasks on two CPUs perform the above operations on the same
> lo device, UAF may occur.

Please also explain the fix:

Do not drop lo->lo_mutex before calling __loop_clr_fd(), so refcnt and
LO_FLAGS_AUTOCLEAR check in lo_release stay in sync.

>
> Fixes: 6cc8e7430801 ("loop: scale loop device by introducing per device lock")
> Signed-off-by: Zqiang 
> ---
>  drivers/block/loop.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index d58d68f3c7cd..5712f1698a66 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1201,7 +1201,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
> release)
> bool partscan = false;
> int lo_number;
>
> -   mutex_lock(>lo_mutex);
> if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
> err = -ENXIO;
> goto out_unlock;
> @@ -1257,7 +1256,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
> release)
> lo_number = lo->lo_number;
> loop_unprepare_queue(lo);
>  out_unlock:
> -   mutex_unlock(>lo_mutex);
> if (partscan) {
> /*
>  * bd_mutex has been held already in release path, so don't
> @@ -1288,12 +1286,11 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
> release)
>  * protects us from all the other places trying to change the 'lo'
>  * device.
>  */
> -   mutex_lock(>lo_mutex);
> +
> lo->lo_flags = 0;
> if (!part_shift)
> lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
> lo->lo_state = Lo_unbound;
> -   mutex_unlock(>lo_mutex);
>
> /*
>  * Need not hold lo_mutex to fput backing file. Calling fput holding
> @@ -1332,9 +1329,10 @@ static int loop_clr_fd(struct loop_device *lo)
> return 0;
> }
> lo->lo_state = Lo_rundown;
> +   err = __loop_clr_fd(lo, false);
> mutex_unlock(>lo_mutex);
>
> -   return __loop_clr_fd(lo, false);
> +   return err;
>  }
>
>  static int
> @@ -1916,13 +1914,12 @@ static void lo_release(struct gendisk *disk, fmode_t 
> mode)
> if (lo->lo_state != Lo_bound)
> goto out_unlock;
> lo->lo_state = Lo_rundown;
> -   mutex_unlock(>lo_mutex);
> /*
>  * In autoclear mode, stop the loop thread
>  * and remove configuration after last close.
>  */
>     __loop_clr_fd(lo, true);
> -   return;
> +   goto out_unlock;
> } else if (lo->lo_state == Lo_bound) {
> /*
>  * Otherwise keep thread (if running) and config,
> --
> 2.17.1
>

LGTM
Reviewed-by: Pavel Tatashin 

Thank you,
Pasha

[PATCH] arm64: kdump: update ppos when reading elfcorehdr

2021-03-19 Thread Pavel Tatashin

The ppos points to a position in the old kernel memory (and in case of
arm64 in the crash kernel since elfcorehdr is passed as a segment). The
function should update the ppos by the amount that was read. This bug is
not exposed by accident, but other platforms update this value properly.
So, fix it in ARM64 version of elfcorehdr_read() as well.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/crash_dump.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c
index e6e284265f19..58303a9ec32c 100644
--- a/arch/arm64/kernel/crash_dump.c
+++ b/arch/arm64/kernel/crash_dump.c
@@ -64,5 +64,7 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
 ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
 {
memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count);
+   *ppos += count;
+
return count;
 }
-- 
2.25.1

[PATCH v3 1/1] kexec: dump kmessage before machine_kexec

2021-03-19 Thread Pavel Tatashin

kmsg_dump(KMSG_DUMP_SHUTDOWN) is called before
machine_restart(), machine_halt(), machine_power_off(), the only one that
is missing is  machine_kexec().

The dmesg output that it contains can be used to study the shutdown
performance of both kernel and systemd during kexec reboot.

Here is example of dmesg data collected after kexec:

root@dplat-cp22:~# cat /sys/fs/pstore/dmesg-ramoops-0 | tail
...
<6>[   70.914592] psci: CPU3 killed (polled 0 ms)
<5>[   70.915705] CPU4: shutdown
<6>[   70.916643] psci: CPU4 killed (polled 4 ms)
<5>[   70.917715] CPU5: shutdown
<6>[   70.918725] psci: CPU5 killed (polled 0 ms)
<5>[   70.919704] CPU6: shutdown
<6>[   70.920726] psci: CPU6 killed (polled 4 ms)
<5>[   70.921642] CPU7: shutdown
<6>[   70.922650] psci: CPU7 killed (polled 0 ms)

Signed-off-by: Pavel Tatashin 
Reviewed-by: Kees Cook 
Reviewed-by: Petr Mladek 
Reviewed-by: Bhupesh Sharma 
Acked-by: Baoquan He 
---
 kernel/kexec_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index a0b6780740c8..6ee4a1cf6e8e 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1179,6 +1180,7 @@ int kernel_kexec(void)
machine_shutdown();
}
 
+   kmsg_dump(KMSG_DUMP_SHUTDOWN);
machine_kexec(kexec_image);
 
 #ifdef CONFIG_KEXEC_JUMP
-- 
2.25.1

[PATCH v3 0/1] dump kmessage before machine_kexec

2021-03-19 Thread Pavel Tatashin

Changelog
v3
- Re-sending because it still has not landed in mainline.
- Sync with mainline
- Added Acked-by: Baoquan He
v2
- Added review-by's
- Sync with mainline

Allow to study performance shutdown via kexec reboot calls by having kmsg
log saved via pstore.

Previous submissions
v1 https://lore.kernel.org/lkml/20200605194642.62278-1-pasha.tatas...@soleen.com
v2 
https://lore.kernel.org/lkml/20210126204125.313820-1-pasha.tatas...@soleen.com

Pavel Tatashin (1):
  kexec: dump kmessage before machine_kexec

 kernel/kexec_core.c | 2 ++
 1 file changed, 2 insertions(+)

-- 
2.25.1

Re: [PATCH v3 1/1] arm64: mm: correct the inside linear map range during hotplug check

2021-03-19 Thread Pavel Tatashin

Hi Will,

Could you please take this patch now that the dependencies landed in
the mainline?

Thank you,
Pasha

On Mon, Feb 22, 2021 at 9:17 AM Pavel Tatashin
 wrote:
>
> > Taking that won't help either though, because it will just explode when
> > it meets 'mm' in Linus's tree.
> >
> > So here's what I think we need to do:
> >
> >   - I'll apply your v3 at -rc1
> >   - You can send backports based on your -v2 for stable once the v3 has
> > been merged upstream.
> >
> > Sound good?
>
> Sounds good, I will send backport once v3 lands in Linus's tree.
>
> Thanks,
> Pasha
>
> >
> > Will

[PATCH v12 14/17] arm64: kexec: install a copy of the linear-map

2021-03-03 Thread Pavel Tatashin

To perform the kexec relocations with the MMU enabled, we need a copy
of the linear map.

Create one, and install it from the relocation code. This has to be done
from the assembly code as it will be idmapped with TTBR0. The kernel
runs in TTRB1, so can't use the break-before-make sequence on the mapping
it is executing from.

The makes no difference yet as the relocation code runs with the MMU
disabled.

Co-developed-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/assembler.h  | 19 +++
 arch/arm64/include/asm/kexec.h  |  2 ++
 arch/arm64/kernel/asm-offsets.c |  2 ++
 arch/arm64/kernel/hibernate-asm.S   | 20 
 arch/arm64/kernel/machine_kexec.c   | 16 ++--
 arch/arm64/kernel/relocate_kernel.S |  3 +++
 6 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 29061b76aab6..3ce8131ad660 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -425,6 +425,25 @@ USER(\label, icivau, \tmp2)// 
invalidate I line PoU
isb
.endm
 
+/*
+ * To prevent the possibility of old and new partial table walks being visible
+ * in the tlb, switch the ttbr to a zero page when we invalidate the old
+ * records. D4.7.1 'General TLB maintenance requirements' in ARM DDI 0487A.i
+ * Even switching to our copied tables will cause a changed output address at
+ * each stage of the walk.
+ */
+   .macro break_before_make_ttbr_switch zero_page, page_table, tmp, tmp2
+   phys_to_ttbr \tmp, \zero_page
+   msr ttbr1_el1, \tmp
+   isb
+   tlbivmalle1
+   dsb nsh
+   phys_to_ttbr \tmp, \page_table
+   offset_ttbr1 \tmp, \tmp2
+   msr ttbr1_el1, \tmp
+   isb
+   .endm
+
 /*
  * reset_pmuserenr_el0 - reset PMUSERENR_EL0 if PMUv3 present
  */
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 305cf0840ed3..59ac166daf53 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -97,6 +97,8 @@ struct kimage_arch {
phys_addr_t dtb_mem;
phys_addr_t kern_reloc;
phys_addr_t el2_vectors;
+   phys_addr_t ttbr1;
+   phys_addr_t zero_page;
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_mem;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 2e3278df1fc3..609362b5aa76 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -158,6 +158,8 @@ int main(void)
 #ifdef CONFIG_KEXEC_CORE
   DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
   DEFINE(KIMAGE_ARCH_EL2_VECTORS,  offsetof(struct kimage, 
arch.el2_vectors));
+  DEFINE(KIMAGE_ARCH_ZERO_PAGE,offsetof(struct kimage, 
arch.zero_page));
+  DEFINE(KIMAGE_ARCH_TTBR1,offsetof(struct kimage, arch.ttbr1));
   DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
   DEFINE(KIMAGE_START, offsetof(struct kimage, start));
   BLANK();
diff --git a/arch/arm64/kernel/hibernate-asm.S 
b/arch/arm64/kernel/hibernate-asm.S
index 8ccca660034e..a31e621ba867 100644
--- a/arch/arm64/kernel/hibernate-asm.S
+++ b/arch/arm64/kernel/hibernate-asm.S
@@ -15,26 +15,6 @@
 #include 
 #include 
 
-/*
- * To prevent the possibility of old and new partial table walks being visible
- * in the tlb, switch the ttbr to a zero page when we invalidate the old
- * records. D4.7.1 'General TLB maintenance requirements' in ARM DDI 0487A.i
- * Even switching to our copied tables will cause a changed output address at
- * each stage of the walk.
- */
-.macro break_before_make_ttbr_switch zero_page, page_table, tmp, tmp2
-   phys_to_ttbr \tmp, \zero_page
-   msr ttbr1_el1, \tmp
-   isb
-   tlbivmalle1
-   dsb nsh
-   phys_to_ttbr \tmp, \page_table
-   offset_ttbr1 \tmp, \tmp2
-   msr ttbr1_el1, \tmp
-   isb
-.endm
-
-
 /*
  * Resume from hibernate
  *
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index f1451d807708..c875ef522e53 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -153,6 +153,8 @@ static void *kexec_page_alloc(void *arg)
 
 int machine_kexec_post_load(struct kimage *kimage)
 {
+   int rc;
+   pgd_t *trans_pgd;
void *reloc_code = page_to_virt(kimage->control_code_page);
long reloc_size;
struct trans_pgd_info info = {
@@ -169,12 +171,22 @@ int machine_kexec_post_load(struct kimage *kimage)
 
kimage->arch.el2_vectors = 0;
if (is_hyp_callable()) {
-   int rc = trans_pgd_copy_el2_vectors(,
-   >arch.el2_vectors);
+   rc = trans_pgd_copy_el

[PATCH v12 15/17] arm64: kexec: keep MMU enabled during kexec relocation

2021-03-03 Thread Pavel Tatashin

Now, that we have linear map page tables configured, keep MMU enabled
to allow faster relocation of segments to final destination.

The performance data: for a moderate size kernel + initramfs: 25M the
relocation was taking 0.382s, with enabled MMU it now takes
0.019s only or x20 improvement.

The time is proportional to the size of relocation, therefore if initramfs
is larger, 100M it could take over a second.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/kexec.h  |  3 +++
 arch/arm64/kernel/asm-offsets.c |  1 +
 arch/arm64/kernel/machine_kexec.c   | 16 ++
 arch/arm64/kernel/relocate_kernel.S | 33 +++--
 4 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 59ac166daf53..5fc87b51f8a9 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -97,8 +97,11 @@ struct kimage_arch {
phys_addr_t dtb_mem;
phys_addr_t kern_reloc;
phys_addr_t el2_vectors;
+   phys_addr_t ttbr0;
phys_addr_t ttbr1;
phys_addr_t zero_page;
+   unsigned long phys_offset;
+   unsigned long t0sz;
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_mem;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 609362b5aa76..ec7bb80aedc8 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -159,6 +159,7 @@ int main(void)
   DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
   DEFINE(KIMAGE_ARCH_EL2_VECTORS,  offsetof(struct kimage, 
arch.el2_vectors));
   DEFINE(KIMAGE_ARCH_ZERO_PAGE,offsetof(struct kimage, 
arch.zero_page));
+  DEFINE(KIMAGE_ARCH_PHYS_OFFSET,  offsetof(struct kimage, 
arch.phys_offset));
   DEFINE(KIMAGE_ARCH_TTBR1,offsetof(struct kimage, arch.ttbr1));
   DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
   DEFINE(KIMAGE_START, offsetof(struct kimage, start));
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index c875ef522e53..d5c8aefc66f3 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -190,6 +190,11 @@ int machine_kexec_post_load(struct kimage *kimage)
reloc_size = __relocate_new_kernel_end - __relocate_new_kernel_start;
memcpy(reloc_code, __relocate_new_kernel_start, reloc_size);
kimage->arch.kern_reloc = __pa(reloc_code);
+   rc = trans_pgd_idmap_page(, >arch.ttbr0,
+ >arch.t0sz, reloc_code);
+   if (rc)
+   return rc;
+   kimage->arch.phys_offset = virt_to_phys(kimage) - (long)kimage;
 
/* Flush the reloc_code in preparation for its execution. */
__flush_dcache_area(reloc_code, reloc_size);
@@ -223,9 +228,9 @@ void machine_kexec(struct kimage *kimage)
local_daif_mask();
 
/*
-* Both restart and cpu_soft_restart will shutdown the MMU, disable data
+* Both restart and kernel_reloc will shutdown the MMU, disable data
 * caches. However, restart will start new kernel or purgatory directly,
-* cpu_soft_restart will transfer control to arm64_relocate_new_kernel
+* kernel_reloc contains the body of arm64_relocate_new_kernel
 * In kexec case, kimage->start points to purgatory assuming that
 * kernel entry and dtb address are embedded in purgatory by
 * userspace (kexec-tools).
@@ -239,10 +244,13 @@ void machine_kexec(struct kimage *kimage)
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
+   void (*kernel_reloc)(struct kimage *kimage);
+
if (is_hyp_callable())
__hyp_set_vectors(kimage->arch.el2_vectors);
-   cpu_soft_restart(kimage->arch.kern_reloc,
-virt_to_phys(kimage), 0, 0);
+   cpu_install_ttbr0(kimage->arch.ttbr0, kimage->arch.t0sz);
+   kernel_reloc = (void *)kimage->arch.kern_reloc;
+   kernel_reloc(kimage);
}
 
BUG(); /* Should never get here. */
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index e83b6380907d..8ac4b2d7f5e8 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -4,6 +4,8 @@
  *
  * Copyright (C) Linaro.
  * Copyright (C) Huawei Futurewei Technologies.
+ * Copyright (C) 2020, Microsoft Corporation.
+ * Pavel Tatashin 
  */
 
 #include 
@@ -15,6 +17,15 @@
 #include 
 #include 
 
+.macro turn_off_mmu tmp1, tmp2
+   mrs \tmp1, sctlr_el1
+   mov_q   \tmp2, SCTLR_ELx_FLAGS
+   bic \tmp1, \tmp1, \tmp2
+   pre_disable_mmu_workaround
+   msr sctlr_el1, \tmp

[PATCH v12 10/17] arm64: kexec: pass kimage as the only argument to relocation function

2021-03-03 Thread Pavel Tatashin

Currently, kexec relocation function (arm64_relocate_new_kernel) accepts
the following arguments:

head:   start of array that contains relocation information.
entry:  entry point for new kernel or purgatory.
dtb_mem:first and only argument to entry.

The number of arguments cannot be easily expended, because this
function is also called from HVC_SOFT_RESTART, which preserves only
three arguments. And, also arm64_relocate_new_kernel is written in
assembly but called without stack, thus no place to move extra arguments
to free registers.

Soon, we will need to pass more arguments: once we enable MMU we
will need to pass information about page tables.

Pass kimage to arm64_relocate_new_kernel, and teach it to get the
required fields from kimage.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/asm-offsets.c |  7 +++
 arch/arm64/kernel/machine_kexec.c   |  6 --
 arch/arm64/kernel/relocate_kernel.S | 10 --
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index a36e2fc330d4..0c92e193f866 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -153,6 +154,12 @@ int main(void)
   DEFINE(PTRAUTH_USER_KEY_APGA,offsetof(struct 
ptrauth_keys_user, apga));
   DEFINE(PTRAUTH_KERNEL_KEY_APIA,  offsetof(struct ptrauth_keys_kernel, 
apia));
   BLANK();
+#endif
+#ifdef CONFIG_KEXEC_CORE
+  DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
+  DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
+  DEFINE(KIMAGE_START, offsetof(struct kimage, start));
+  BLANK();
 #endif
   return 0;
 }
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index b150b65f0b84..2e734e4ae12e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -83,6 +83,8 @@ static void kexec_list_flush(struct kimage *kimage)
 {
kimage_entry_t *entry;
 
+   __flush_dcache_area(kimage, sizeof(*kimage));
+
for (entry = >head; ; entry++) {
unsigned int flag;
void *addr;
@@ -198,8 +200,8 @@ void machine_kexec(struct kimage *kimage)
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
-   cpu_soft_restart(kimage->arch.kern_reloc, kimage->head,
-kimage->start, kimage->arch.dtb_mem);
+   cpu_soft_restart(kimage->arch.kern_reloc, virt_to_phys(kimage),
+0, 0);
}
 
BUG(); /* Should never get here. */
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index 718037bef560..36b4496524c3 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -27,9 +27,7 @@
  */
 SYM_CODE_START(arm64_relocate_new_kernel)
/* Setup the list loop variables. */
-   mov x18, x2 /* x18 = dtb address */
-   mov x17, x1 /* x17 = kimage_start */
-   mov x16, x0 /* x16 = kimage_head */
+   ldr x16, [x0, #KIMAGE_HEAD] /* x16 = kimage_head */
mov x14, xzr/* x14 = entry ptr */
mov x13, xzr/* x13 = copy dest */
raw_dcache_line_size x15, x1/* x15 = dcache line size */
@@ -63,12 +61,12 @@ SYM_CODE_START(arm64_relocate_new_kernel)
isb
 
/* Start new image. */
-   mov x0, x18
+   ldr x4, [x0, #KIMAGE_START] /* relocation start */
+   ldr x0, [x0, #KIMAGE_ARCH_DTB_MEM]  /* dtb address */
mov x1, xzr
mov x2, xzr
mov x3, xzr
-   br  x17
-
+   br  x4
 SYM_CODE_END(arm64_relocate_new_kernel)
 
 .align 3   /* To keep the 64-bit values below naturally aligned. */
-- 
2.25.1

[PATCH v12 16/17] arm64: kexec: remove the pre-kexec PoC maintenance

2021-03-03 Thread Pavel Tatashin

Now that kexec does its relocations with the MMU enabled, we no longer
need to clean the relocation data to the PoC.

Co-developed-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/machine_kexec.c | 40 ---
 1 file changed, 40 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index d5c8aefc66f3..a1c9bee0cddd 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -76,45 +76,6 @@ int machine_kexec_prepare(struct kimage *kimage)
return 0;
 }
 
-/**
- * kexec_list_flush - Helper to flush the kimage list and source pages to PoC.
- */
-static void kexec_list_flush(struct kimage *kimage)
-{
-   kimage_entry_t *entry;
-
-   __flush_dcache_area(kimage, sizeof(*kimage));
-
-   for (entry = >head; ; entry++) {
-   unsigned int flag;
-   void *addr;
-
-   /* flush the list entries. */
-   __flush_dcache_area(entry, sizeof(kimage_entry_t));
-
-   flag = *entry & IND_FLAGS;
-   if (flag == IND_DONE)
-   break;
-
-   addr = phys_to_virt(*entry & PAGE_MASK);
-
-   switch (flag) {
-   case IND_INDIRECTION:
-   /* Set entry point just before the new list page. */
-   entry = (kimage_entry_t *)addr - 1;
-   break;
-   case IND_SOURCE:
-   /* flush the source pages. */
-   __flush_dcache_area(addr, PAGE_SIZE);
-   break;
-   case IND_DESTINATION:
-   break;
-   default:
-   BUG();
-   }
-   }
-}
-
 /**
  * kexec_segment_flush - Helper to flush the kimage segments to PoC.
  */
@@ -200,7 +161,6 @@ int machine_kexec_post_load(struct kimage *kimage)
__flush_dcache_area(reloc_code, reloc_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
   reloc_size);
-   kexec_list_flush(kimage);
kexec_image_info(kimage);
 
return 0;
-- 
2.25.1

[PATCH v12 12/17] arm64: kexec: relocate in EL1 mode

2021-03-03 Thread Pavel Tatashin

Since we are going to keep MMU enabled during relocation, we need to
keep EL1 mode throughout the relocation.

Keep EL1 enabled, and switch EL2 only before enterying the new world.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/cpu-reset.h   |  3 +--
 arch/arm64/kernel/machine_kexec.c   |  4 ++--
 arch/arm64/kernel/relocate_kernel.S | 13 +++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index 1922e7a690f8..f6d95512fec6 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -20,11 +20,10 @@ static inline void __noreturn cpu_soft_restart(unsigned 
long entry,
 {
typeof(__cpu_soft_restart) *restart;
 
-   unsigned long el2_switch = is_hyp_callable();
restart = (void *)__pa_symbol(__cpu_soft_restart);
 
cpu_install_idmap();
-   restart(el2_switch, entry, arg0, arg1, arg2);
+   restart(0, entry, arg0, arg1, arg2);
unreachable();
 }
 
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index fb03b6676fb9..d5940b7889f8 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -231,8 +231,8 @@ void machine_kexec(struct kimage *kimage)
} else {
if (is_hyp_callable())
__hyp_set_vectors(kimage->arch.el2_vectors);
-   cpu_soft_restart(kimage->arch.kern_reloc, virt_to_phys(kimage),
-0, 0);
+   cpu_soft_restart(kimage->arch.kern_reloc,
+virt_to_phys(kimage), 0, 0);
}
 
BUG(); /* Should never get here. */
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index 36b4496524c3..df023b82544b 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
@@ -61,12 +62,20 @@ SYM_CODE_START(arm64_relocate_new_kernel)
isb
 
/* Start new image. */
+   ldr x1, [x0, #KIMAGE_ARCH_EL2_VECTORS]  /* relocation start */
+   cbz x1, .Lel1
+   ldr x1, [x0, #KIMAGE_START] /* relocation start */
+   ldr x2, [x0, #KIMAGE_ARCH_DTB_MEM]  /* dtb address */
+   mov x3, xzr
+   mov x4, xzr
+   mov x0, #HVC_SOFT_RESTART
+   hvc #0  /* Jumps from el2 */
+.Lel1:
ldr x4, [x0, #KIMAGE_START] /* relocation start */
ldr x0, [x0, #KIMAGE_ARCH_DTB_MEM]  /* dtb address */
-   mov x1, xzr
mov x2, xzr
mov x3, xzr
-   br  x4
+   br  x4  /* Jumps from el1 */
 SYM_CODE_END(arm64_relocate_new_kernel)
 
 .align 3   /* To keep the 64-bit values below naturally aligned. */
-- 
2.25.1

[PATCH v12 09/17] arm64: kexec: Use dcache ops macros instead of open-coding

2021-03-03 Thread Pavel Tatashin

From: James Morse 

kexec does dcache maintenance when it re-writes all memory. Our
dcache_by_line_op macro depends on reading the sanitised DminLine
from memory. Kexec may have overwritten this, so open-codes the
sequence.

dcache_by_line_op is a whole set of macros, it uses dcache_line_size
which uses read_ctr for the sanitsed DminLine. Reading the DminLine
is the first thing the dcache_by_line_op does.

Rename dcache_by_line_op dcache_by_myline_op and take DminLine as
an argument. Kexec can now use the slightly smaller macro.

This makes up-coming changes to the dcache maintenance easier on
the eye.

Code generated by the existing callers is unchanged.

Signed-off-by: James Morse 

[Fixed merging issues]

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/assembler.h  | 12 
 arch/arm64/kernel/relocate_kernel.S | 13 +++--
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index ca31594d3d6c..29061b76aab6 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -371,10 +371,9 @@ alternative_else
 alternative_endif
.endm
 
-   .macro dcache_by_line_op op, domain, kaddr, size, tmp1, tmp2
-   dcache_line_size \tmp1, \tmp2
+   .macro dcache_by_myline_op op, domain, kaddr, size, linesz, tmp2
add \size, \kaddr, \size
-   sub \tmp2, \tmp1, #1
+   sub \tmp2, \linesz, #1
bic \kaddr, \kaddr, \tmp2
 9998:
.ifc\op, cvau
@@ -394,12 +393,17 @@ alternative_endif
.endif
.endif
.endif
-   add \kaddr, \kaddr, \tmp1
+   add \kaddr, \kaddr, \linesz
cmp \kaddr, \size
b.lo9998b
dsb \domain
.endm
 
+   .macro dcache_by_line_op op, domain, kaddr, size, tmp1, tmp2
+   dcache_line_size \tmp1, \tmp2
+   dcache_by_myline_op \op, \domain, \kaddr, \size, \tmp1, \tmp2
+   .endm
+
 /*
  * Macro to perform an instruction cache maintenance for the interval
  * [start, end)
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index 8058fabe0a76..718037bef560 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -41,16 +41,9 @@ SYM_CODE_START(arm64_relocate_new_kernel)
tbz x16, IND_SOURCE_BIT, .Ltest_indirection
 
/* Invalidate dest page to PoC. */
-   mov x2, x13
-   add x20, x2, #PAGE_SIZE
-   sub x1, x15, #1
-   bic x2, x2, x1
-2: dc  ivac, x2
-   add x2, x2, x15
-   cmp x2, x20
-   b.lo2b
-   dsb sy
-
+   mov x2, x13
+   mov x1, #PAGE_SIZE
+   dcache_by_myline_op ivac, sy, x2, x1, x15, x20
copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
b   .Lnext
 .Ltest_indirection:
-- 
2.25.1

[PATCH v12 11/17] arm64: kexec: kexec may require EL2 vectors

2021-03-03 Thread Pavel Tatashin

If we have a EL2 mode without VHE, the EL2 vectors are needed in order
to switch to EL2 and jump to new world with hypervisor privileges.

In preporation to MMU enabled relocation, configure our EL2 table now.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/Kconfig|  2 +-
 arch/arm64/include/asm/kexec.h|  1 +
 arch/arm64/kernel/asm-offsets.c   |  1 +
 arch/arm64/kernel/machine_kexec.c | 31 +++
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1f212b47a48a..825fe88b7c08 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1141,7 +1141,7 @@ config CRASH_DUMP
 
 config TRANS_TABLE
def_bool y
-   depends on HIBERNATION
+   depends on HIBERNATION || KEXEC_CORE
 
 config XEN_DOM0
def_bool y
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 9befcd87e9a8..305cf0840ed3 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -96,6 +96,7 @@ struct kimage_arch {
void *dtb;
phys_addr_t dtb_mem;
phys_addr_t kern_reloc;
+   phys_addr_t el2_vectors;
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_mem;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0c92e193f866..2e3278df1fc3 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -157,6 +157,7 @@ int main(void)
 #endif
 #ifdef CONFIG_KEXEC_CORE
   DEFINE(KIMAGE_ARCH_DTB_MEM,  offsetof(struct kimage, arch.dtb_mem));
+  DEFINE(KIMAGE_ARCH_EL2_VECTORS,  offsetof(struct kimage, 
arch.el2_vectors));
   DEFINE(KIMAGE_HEAD,  offsetof(struct kimage, head));
   DEFINE(KIMAGE_START, offsetof(struct kimage, start));
   BLANK();
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 2e734e4ae12e..fb03b6676fb9 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cpu-reset.h"
 
@@ -42,7 +43,9 @@ static void _kexec_image_info(const char *func, int line,
pr_debug("start:   %lx\n", kimage->start);
pr_debug("head:%lx\n", kimage->head);
pr_debug("nr_segments: %lu\n", kimage->nr_segments);
+   pr_debug("dtb_mem: %pa\n", >arch.dtb_mem);
pr_debug("kern_reloc: %pa\n", >arch.kern_reloc);
+   pr_debug("el2_vectors: %pa\n", >arch.el2_vectors);
 
for (i = 0; i < kimage->nr_segments; i++) {
pr_debug("  segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu 
pages\n",
@@ -137,9 +140,27 @@ static void kexec_segment_flush(const struct kimage 
*kimage)
}
 }
 
+/* Allocates pages for kexec page table */
+static void *kexec_page_alloc(void *arg)
+{
+   struct kimage *kimage = (struct kimage *)arg;
+   struct page *page = kimage_alloc_control_pages(kimage, 0);
+
+   if (!page)
+   return NULL;
+
+   memset(page_address(page), 0, PAGE_SIZE);
+
+   return page_address(page);
+}
+
 int machine_kexec_post_load(struct kimage *kimage)
 {
void *reloc_code = page_to_virt(kimage->control_code_page);
+   struct trans_pgd_info info = {
+   .trans_alloc_page   = kexec_page_alloc,
+   .trans_alloc_arg= kimage,
+   };
 
/* If in place, relocation is not used, only flush next kernel */
if (kimage->head & IND_DONE) {
@@ -148,6 +169,14 @@ int machine_kexec_post_load(struct kimage *kimage)
return 0;
}
 
+   kimage->arch.el2_vectors = 0;
+   if (is_hyp_callable()) {
+   int rc = trans_pgd_copy_el2_vectors(,
+   >arch.el2_vectors);
+   if (rc)
+   return rc;
+   }
+
memcpy(reloc_code, arm64_relocate_new_kernel,
   arm64_relocate_new_kernel_size);
kimage->arch.kern_reloc = __pa(reloc_code);
@@ -200,6 +229,8 @@ void machine_kexec(struct kimage *kimage)
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
+   if (is_hyp_callable())
+   __hyp_set_vectors(kimage->arch.el2_vectors);
cpu_soft_restart(kimage->arch.kern_reloc, virt_to_phys(kimage),
 0, 0);
}
-- 
2.25.1

[PATCH v12 17/17] arm64: kexec: Remove cpu-reset.h

2021-03-03 Thread Pavel Tatashin

This header contains only cpu_soft_restart() which is never used directly
anymore. So, remove this header, and rename the helper to be
cpu_soft_restart().

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/kexec.h|  6 ++
 arch/arm64/kernel/cpu-reset.S |  7 +++
 arch/arm64/kernel/cpu-reset.h | 30 --
 arch/arm64/kernel/machine_kexec.c |  6 ++
 4 files changed, 11 insertions(+), 38 deletions(-)
 delete mode 100644 arch/arm64/kernel/cpu-reset.h

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 5fc87b51f8a9..ee71ae3b93ed 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,6 +90,12 @@ static inline void crash_prepare_suspend(void) {}
 static inline void crash_post_resume(void) {}
 #endif
 
+#if defined(CONFIG_KEXEC_CORE)
+void cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
+ unsigned long arg0, unsigned long arg1,
+ unsigned long arg2);
+#endif
+
 #define ARCH_HAS_KIMAGE_ARCH
 
 struct kimage_arch {
diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
index 37721eb6f9a1..5d47d6c92634 100644
--- a/arch/arm64/kernel/cpu-reset.S
+++ b/arch/arm64/kernel/cpu-reset.S
@@ -16,8 +16,7 @@
 .pushsection.idmap.text, "awx"
 
 /*
- * __cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2) - Helper for
- * cpu_soft_restart.
+ * cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2)
  *
  * @el2_switch: Flag to indicate a switch to EL2 is needed.
  * @entry: Location to jump to for soft reset.
@@ -29,7 +28,7 @@
  * branch to what would be the reset vector. It must be executed with the
  * flat identity mapping.
  */
-SYM_CODE_START(__cpu_soft_restart)
+SYM_CODE_START(cpu_soft_restart)
/* Clear sctlr_el1 flags. */
mrs x12, sctlr_el1
mov_q   x13, SCTLR_ELx_FLAGS
@@ -51,6 +50,6 @@ SYM_CODE_START(__cpu_soft_restart)
mov x1, x3  // arg1
mov x2, x4  // arg2
br  x8
-SYM_CODE_END(__cpu_soft_restart)
+SYM_CODE_END(cpu_soft_restart)
 
 .popsection
diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
deleted file mode 100644
index f6d95512fec6..
--- a/arch/arm64/kernel/cpu-reset.h
+++ /dev/null
@@ -1,30 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * CPU reset routines
- *
- * Copyright (C) 2015 Huawei Futurewei Technologies.
- */
-
-#ifndef _ARM64_CPU_RESET_H
-#define _ARM64_CPU_RESET_H
-
-#include 
-
-void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
-   unsigned long arg0, unsigned long arg1, unsigned long arg2);
-
-static inline void __noreturn cpu_soft_restart(unsigned long entry,
-  unsigned long arg0,
-  unsigned long arg1,
-  unsigned long arg2)
-{
-   typeof(__cpu_soft_restart) *restart;
-
-   restart = (void *)__pa_symbol(__cpu_soft_restart);
-
-   cpu_install_idmap();
-   restart(0, entry, arg0, arg1, arg2);
-   unreachable();
-}
-
-#endif
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index a1c9bee0cddd..ef7ba93f2bd6 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -23,8 +23,6 @@
 #include 
 #include 
 
-#include "cpu-reset.h"
-
 /**
  * kexec_image_info - For debugging output.
  */
@@ -197,10 +195,10 @@ void machine_kexec(struct kimage *kimage)
 * In kexec_file case, the kernel starts directly without purgatory.
 */
if (kimage->head & IND_DONE) {
-   typeof(__cpu_soft_restart) *restart;
+   typeof(cpu_soft_restart) *restart;
 
cpu_install_idmap();
-   restart = (void *)__pa_symbol(__cpu_soft_restart);
+   restart = (void *)__pa_symbol(cpu_soft_restart);
restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
0, 0);
} else {
-- 
2.25.1

[PATCH v12 13/17] arm64: kexec: use ld script for relocation function

2021-03-03 Thread Pavel Tatashin

Currently, relocation code declares start and end variables
which are used to compute its size.

The better way to do this is to use ld script incited, and put relocation
function in its own section.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/sections.h   |  1 +
 arch/arm64/kernel/machine_kexec.c   | 14 ++
 arch/arm64/kernel/relocate_kernel.S | 15 ++-
 arch/arm64/kernel/vmlinux.lds.S | 19 +++
 4 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/sections.h 
b/arch/arm64/include/asm/sections.h
index 2f36b16a5b5d..31e459af89f6 100644
--- a/arch/arm64/include/asm/sections.h
+++ b/arch/arm64/include/asm/sections.h
@@ -20,5 +20,6 @@ extern char __exittext_begin[], __exittext_end[];
 extern char __irqentry_text_start[], __irqentry_text_end[];
 extern char __mmuoff_data_start[], __mmuoff_data_end[];
 extern char __entry_tramp_text_start[], __entry_tramp_text_end[];
+extern char __relocate_new_kernel_start[], __relocate_new_kernel_end[];
 
 #endif /* __ASM_SECTIONS_H */
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index d5940b7889f8..f1451d807708 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -20,14 +20,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpu-reset.h"
 
-/* Global variables for the arm64_relocate_new_kernel routine. */
-extern const unsigned char arm64_relocate_new_kernel[];
-extern const unsigned long arm64_relocate_new_kernel_size;
-
 /**
  * kexec_image_info - For debugging output.
  */
@@ -157,6 +154,7 @@ static void *kexec_page_alloc(void *arg)
 int machine_kexec_post_load(struct kimage *kimage)
 {
void *reloc_code = page_to_virt(kimage->control_code_page);
+   long reloc_size;
struct trans_pgd_info info = {
.trans_alloc_page   = kexec_page_alloc,
.trans_alloc_arg= kimage,
@@ -177,14 +175,14 @@ int machine_kexec_post_load(struct kimage *kimage)
return rc;
}
 
-   memcpy(reloc_code, arm64_relocate_new_kernel,
-  arm64_relocate_new_kernel_size);
+   reloc_size = __relocate_new_kernel_end - __relocate_new_kernel_start;
+   memcpy(reloc_code, __relocate_new_kernel_start, reloc_size);
kimage->arch.kern_reloc = __pa(reloc_code);
 
/* Flush the reloc_code in preparation for its execution. */
-   __flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
+   __flush_dcache_area(reloc_code, reloc_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
-  arm64_relocate_new_kernel_size);
+  reloc_size);
kexec_list_flush(kimage);
kexec_image_info(kimage);
 
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index df023b82544b..7a600ba33ae1 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -15,6 +15,7 @@
 #include 
 #include 
 
+.pushsection".kexec_relocate.text", "ax"
 /*
  * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
  *
@@ -77,16 +78,4 @@ SYM_CODE_START(arm64_relocate_new_kernel)
mov x3, xzr
br  x4  /* Jumps from el1 */
 SYM_CODE_END(arm64_relocate_new_kernel)
-
-.align 3   /* To keep the 64-bit values below naturally aligned. */
-
-.Lcopy_end:
-.org   KEXEC_CONTROL_PAGE_SIZE
-
-/*
- * arm64_relocate_new_kernel_size - Number of bytes to copy to the
- * control_code_page.
- */
-.globl arm64_relocate_new_kernel_size
-arm64_relocate_new_kernel_size:
-   .quad   .Lcopy_end - arm64_relocate_new_kernel
+.popsection
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 7eea7888bb02..0d9d5e6af66f 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -92,6 +93,16 @@ jiffies = jiffies_64;
 #define HIBERNATE_TEXT
 #endif
 
+#ifdef CONFIG_KEXEC_CORE
+#define KEXEC_TEXT \
+   . = ALIGN(SZ_4K);   \
+   __relocate_new_kernel_start = .;\
+   *(.kexec_relocate.text) \
+   __relocate_new_kernel_end = .;
+#else
+#define KEXEC_TEXT
+#endif
+
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 #define TRAMP_TEXT \
. = ALIGN(PAGE_SIZE);   \
@@ -152,6 +163,7 @@ SECTIONS
HYPERVISOR_TEXT
IDMAP_TEXT
HIBERNATE_TEXT
+   KEXEC_TEXT
TRAMP_TEXT
*(.fixup)
*(.gnu.warning)
@@ -336,3 +348,10 @@ ASS

[PATCH v12 08/17] arm64: kexec: skip relocation code for inplace kexec

2021-03-03 Thread Pavel Tatashin

In case of kdump or when segments are already in place the relocation
is not needed, therefore the setup of relocation function and call to
it can be skipped.

Signed-off-by: Pavel Tatashin 
Suggested-by: James Morse 
---
 arch/arm64/kernel/machine_kexec.c   | 34 ++---
 arch/arm64/kernel/relocate_kernel.S |  3 ---
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 3a034bc25709..b150b65f0b84 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -139,21 +139,23 @@ int machine_kexec_post_load(struct kimage *kimage)
 {
void *reloc_code = page_to_virt(kimage->control_code_page);
 
-   /* If in place flush new kernel image, else flush lists and buffers */
-   if (kimage->head & IND_DONE)
+   /* If in place, relocation is not used, only flush next kernel */
+   if (kimage->head & IND_DONE) {
kexec_segment_flush(kimage);
-   else
-   kexec_list_flush(kimage);
+   kexec_image_info(kimage);
+   return 0;
+   }
 
memcpy(reloc_code, arm64_relocate_new_kernel,
   arm64_relocate_new_kernel_size);
kimage->arch.kern_reloc = __pa(reloc_code);
-   kexec_image_info(kimage);
 
/* Flush the reloc_code in preparation for its execution. */
__flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
   arm64_relocate_new_kernel_size);
+   kexec_list_flush(kimage);
+   kexec_image_info(kimage);
 
return 0;
 }
@@ -180,19 +182,25 @@ void machine_kexec(struct kimage *kimage)
local_daif_mask();
 
/*
-* cpu_soft_restart will shutdown the MMU, disable data caches, then
-* transfer control to the kern_reloc which contains a copy of
-* the arm64_relocate_new_kernel routine.  arm64_relocate_new_kernel
-* uses physical addressing to relocate the new image to its final
-* position and transfers control to the image entry point when the
-* relocation is complete.
+* Both restart and cpu_soft_restart will shutdown the MMU, disable data
+* caches. However, restart will start new kernel or purgatory directly,
+* cpu_soft_restart will transfer control to arm64_relocate_new_kernel
 * In kexec case, kimage->start points to purgatory assuming that
 * kernel entry and dtb address are embedded in purgatory by
 * userspace (kexec-tools).
 * In kexec_file case, the kernel starts directly without purgatory.
 */
-   cpu_soft_restart(kimage->arch.kern_reloc, kimage->head, kimage->start,
-kimage->arch.dtb_mem);
+   if (kimage->head & IND_DONE) {
+   typeof(__cpu_soft_restart) *restart;
+
+   cpu_install_idmap();
+   restart = (void *)__pa_symbol(__cpu_soft_restart);
+   restart(is_hyp_callable(), kimage->start, kimage->arch.dtb_mem,
+   0, 0);
+   } else {
+   cpu_soft_restart(kimage->arch.kern_reloc, kimage->head,
+kimage->start, kimage->arch.dtb_mem);
+   }
 
BUG(); /* Should never get here. */
 }
diff --git a/arch/arm64/kernel/relocate_kernel.S 
b/arch/arm64/kernel/relocate_kernel.S
index b78ea5de97a4..8058fabe0a76 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -32,8 +32,6 @@ SYM_CODE_START(arm64_relocate_new_kernel)
mov x16, x0 /* x16 = kimage_head */
mov x14, xzr/* x14 = entry ptr */
mov x13, xzr/* x13 = copy dest */
-   /* Check if the new image needs relocation. */
-   tbnzx16, IND_DONE_BIT, .Ldone
raw_dcache_line_size x15, x1/* x15 = dcache line size */
 .Lloop:
and x12, x16, PAGE_MASK /* x12 = addr */
@@ -65,7 +63,6 @@ SYM_CODE_START(arm64_relocate_new_kernel)
 .Lnext:
ldr x16, [x14], #8  /* entry = *ptr++ */
tbz x16, IND_DONE_BIT, .Lloop   /* while (!(entry & DONE)) */
-.Ldone:
/* wait for writes from copy_page to finish */
dsb nsh
ic  iallu
-- 
2.25.1

[PATCH v12 06/17] arm64: hibernate: abstract ttrb0 setup function

2021-03-03 Thread Pavel Tatashin

Currently, only hibernate sets custom ttbr0 with safe idmaped function.
Kexec, is also going to be using this functinality when relocation code
is going to be idmapped.

Move the setup seqeuence to a dedicated cpu_install_ttbr0() for custom
ttbr0.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/mmu_context.h | 24 
 arch/arm64/kernel/hibernate.c| 21 +
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h 
b/arch/arm64/include/asm/mmu_context.h
index 70ce8c1d2b07..c6521c8c06ac 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -132,6 +132,30 @@ static inline void cpu_install_idmap(void)
cpu_switch_mm(lm_alias(idmap_pg_dir), _mm);
 }
 
+/*
+ * Load our new page tables. A strict BBM approach requires that we ensure that
+ * TLBs are free of any entries that may overlap with the global mappings we 
are
+ * about to install.
+ *
+ * For a real hibernate/resume/kexec cycle TTBR0 currently points to a zero
+ * page, but TLBs may contain stale ASID-tagged entries (e.g. for EFI runtime
+ * services), while for a userspace-driven test_resume cycle it points to
+ * userspace page tables (and we must point it at a zero page ourselves).
+ *
+ * We change T0SZ as part of installing the idmap. This is undone by
+ * cpu_uninstall_idmap() in __cpu_suspend_exit().
+ */
+static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
+{
+   cpu_set_reserved_ttbr0();
+   local_flush_tlb_all();
+   __cpu_set_tcr_t0sz(t0sz);
+
+   /* avoid cpu_switch_mm() and its SW-PAN and CNP interactions */
+   write_sysreg(ttbr0, ttbr0_el1);
+   isb();
+}
+
 /*
  * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
  * avoiding the possibility of conflicting TLB entries being allocated.
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 0b8bad8bb6eb..ded5115bcb63 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -206,26 +206,7 @@ static int create_safe_exec_page(void *src_start, size_t 
length,
if (rc)
return rc;
 
-   /*
-* Load our new page tables. A strict BBM approach requires that we
-* ensure that TLBs are free of any entries that may overlap with the
-* global mappings we are about to install.
-*
-* For a real hibernate/resume cycle TTBR0 currently points to a zero
-* page, but TLBs may contain stale ASID-tagged entries (e.g. for EFI
-* runtime services), while for a userspace-driven test_resume cycle it
-* points to userspace page tables (and we must point it at a zero page
-* ourselves).
-*
-* We change T0SZ as part of installing the idmap. This is undone by
-* cpu_uninstall_idmap() in __cpu_suspend_exit().
-*/
-   cpu_set_reserved_ttbr0();
-   local_flush_tlb_all();
-   __cpu_set_tcr_t0sz(t0sz);
-   write_sysreg(trans_ttbr0, ttbr0_el1);
-   isb();
-
+   cpu_install_ttbr0(trans_ttbr0, t0sz);
*phys_dst_addr = virt_to_phys(page);
 
return 0;
-- 
2.25.1

[PATCH v12 07/17] arm64: kexec: flush image and lists during kexec load time

2021-03-03 Thread Pavel Tatashin

Currently, during kexec load we are copying relocation function and
flushing it. However, we can also flush kexec relocation buffers and
if new kernel image is already in place (i.e. crash kernel), we can
also flush the new kernel image itself.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/machine_kexec.c | 49 +++
 1 file changed, 23 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 90a335c74442..3a034bc25709 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -59,23 +59,6 @@ void machine_kexec_cleanup(struct kimage *kimage)
/* Empty routine needed to avoid build errors. */
 }
 
-int machine_kexec_post_load(struct kimage *kimage)
-{
-   void *reloc_code = page_to_virt(kimage->control_code_page);
-
-   memcpy(reloc_code, arm64_relocate_new_kernel,
-  arm64_relocate_new_kernel_size);
-   kimage->arch.kern_reloc = __pa(reloc_code);
-   kexec_image_info(kimage);
-
-   /* Flush the reloc_code in preparation for its execution. */
-   __flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
-   flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
-  arm64_relocate_new_kernel_size);
-
-   return 0;
-}
-
 /**
  * machine_kexec_prepare - Prepare for a kexec reboot.
  *
@@ -152,6 +135,29 @@ static void kexec_segment_flush(const struct kimage 
*kimage)
}
 }
 
+int machine_kexec_post_load(struct kimage *kimage)
+{
+   void *reloc_code = page_to_virt(kimage->control_code_page);
+
+   /* If in place flush new kernel image, else flush lists and buffers */
+   if (kimage->head & IND_DONE)
+   kexec_segment_flush(kimage);
+   else
+   kexec_list_flush(kimage);
+
+   memcpy(reloc_code, arm64_relocate_new_kernel,
+  arm64_relocate_new_kernel_size);
+   kimage->arch.kern_reloc = __pa(reloc_code);
+   kexec_image_info(kimage);
+
+   /* Flush the reloc_code in preparation for its execution. */
+   __flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
+   flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
+  arm64_relocate_new_kernel_size);
+
+   return 0;
+}
+
 /**
  * machine_kexec - Do the kexec reboot.
  *
@@ -169,13 +175,6 @@ void machine_kexec(struct kimage *kimage)
WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
"Some CPUs may be stale, kdump will be unreliable.\n");
 
-   /* Flush the kimage list and its buffers. */
-   kexec_list_flush(kimage);
-
-   /* Flush the new image if already in place. */
-   if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE))
-   kexec_segment_flush(kimage);
-
pr_info("Bye!\n");
 
local_daif_mask();
@@ -250,8 +249,6 @@ void arch_kexec_protect_crashkres(void)
 {
int i;
 
-   kexec_segment_flush(kexec_crash_image);
-
for (i = 0; i < kexec_crash_image->nr_segments; i++)
set_memory_valid(
__phys_to_virt(kexec_crash_image->segment[i].mem),
-- 
2.25.1

[PATCH v12 05/17] arm64: trans_pgd: hibernate: Add trans_pgd_copy_el2_vectors

2021-03-03 Thread Pavel Tatashin

Users of trans_pgd may also need a copy of vector table because it is
also may be overwritten if a linear map can be overwritten.

Move setup of EL2 vectors from hibernate to trans_pgd, so it can be
later shared with kexec as well.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/trans_pgd.h |  3 +++
 arch/arm64/include/asm/virt.h  |  3 +++
 arch/arm64/kernel/hibernate.c  | 28 ++--
 arch/arm64/mm/trans_pgd.c  | 20 
 4 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/trans_pgd.h 
b/arch/arm64/include/asm/trans_pgd.h
index 5d08e5adf3d5..e0760e52d36d 100644
--- a/arch/arm64/include/asm/trans_pgd.h
+++ b/arch/arm64/include/asm/trans_pgd.h
@@ -36,4 +36,7 @@ int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t 
*trans_pgd,
 int trans_pgd_idmap_page(struct trans_pgd_info *info, phys_addr_t *trans_ttbr0,
 unsigned long *t0sz, void *page);
 
+int trans_pgd_copy_el2_vectors(struct trans_pgd_info *info,
+  phys_addr_t *el2_vectors);
+
 #endif /* _ASM_TRANS_TABLE_H */
diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
index 4216c8623538..bfbb66018114 100644
--- a/arch/arm64/include/asm/virt.h
+++ b/arch/arm64/include/asm/virt.h
@@ -67,6 +67,9 @@
  */
 extern u32 __boot_cpu_mode[2];
 
+extern char __hyp_stub_vectors[];
+#define ARM64_VECTOR_TABLE_LEN SZ_2K
+
 void __hyp_set_vectors(phys_addr_t phys_vector_base);
 void __hyp_reset_vectors(void);
 
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index c764574a1acb..0b8bad8bb6eb 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -48,12 +48,6 @@
  */
 extern int in_suspend;
 
-/* temporary el2 vectors in the __hibernate_exit_text section. */
-extern char hibernate_el2_vectors[];
-
-/* hyp-stub vectors, used to restore el2 during resume from hibernate. */
-extern char __hyp_stub_vectors[];
-
 /*
  * The logical cpu number we should resume on, initialised to a non-cpu
  * number.
@@ -428,6 +422,7 @@ int swsusp_arch_resume(void)
void *zero_page;
size_t exit_size;
pgd_t *tmp_pg_dir;
+   phys_addr_t el2_vectors;
void __noreturn (*hibernate_exit)(phys_addr_t, phys_addr_t, void *,
  void *, phys_addr_t, phys_addr_t);
struct trans_pgd_info trans_info = {
@@ -455,6 +450,14 @@ int swsusp_arch_resume(void)
return -ENOMEM;
}
 
+   if (is_hyp_callable()) {
+   rc = trans_pgd_copy_el2_vectors(_info, _vectors);
+   if (rc) {
+   pr_err("Failed to setup el2 vectors\n");
+   return rc;
+   }
+   }
+
exit_size = __hibernate_exit_text_end - __hibernate_exit_text_start;
/*
 * Copy swsusp_arch_suspend_exit() to a safe page. This will generate
@@ -467,25 +470,14 @@ int swsusp_arch_resume(void)
return rc;
}
 
-   /*
-* The hibernate exit text contains a set of el2 vectors, that will
-* be executed at el2 with the mmu off in order to reload hyp-stub.
-*/
-   __flush_dcache_area(hibernate_exit, exit_size);
-
/*
 * KASLR will cause the el2 vectors to be in a different location in
 * the resumed kernel. Load hibernate's temporary copy into el2.
 *
 * We can skip this step if we booted at EL1, or are running with VHE.
 */
-   if (is_hyp_callable()) {
-   phys_addr_t el2_vectors = (phys_addr_t)hibernate_exit;
-   el2_vectors += hibernate_el2_vectors -
-  __hibernate_exit_text_start; /* offset */
-
+   if (is_hyp_callable())
__hyp_set_vectors(el2_vectors);
-   }
 
hibernate_exit(virt_to_phys(tmp_pg_dir), resume_hdr.ttbr1_el1,
   resume_hdr.reenter_kernel, restore_pblist,
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 527f0a39c3da..61549451ed3a 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -322,3 +322,23 @@ int trans_pgd_idmap_page(struct trans_pgd_info *info, 
phys_addr_t *trans_ttbr0,
 
return 0;
 }
+
+/*
+ * Create a copy of the vector table so we can call HVC_SET_VECTORS or
+ * HVC_SOFT_RESTART from contexts where the table may be overwritten.
+ */
+int trans_pgd_copy_el2_vectors(struct trans_pgd_info *info,
+  phys_addr_t *el2_vectors)
+{
+   void *hyp_stub = trans_alloc(info);
+
+   if (!hyp_stub)
+   return -ENOMEM;
+   *el2_vectors = virt_to_phys(hyp_stub);
+   memcpy(hyp_stub, &__hyp_stub_vectors, ARM64_VECTOR_TABLE_LEN);
+   __flush_icache_range((unsigned long)hyp_stub,
+(unsigned long)hyp_stub + ARM64_VECTOR_TABLE_LEN);
+   __flu

[PATCH v12 03/17] arm64: hyp-stub: Move el1_sync into the vectors

2021-03-03 Thread Pavel Tatashin

From: James Morse 

The hyp-stub's el1_sync code doesn't do very much, this can easily fit
in the vectors.

With this, all of the hyp-stubs behaviour is contained in its vectors.
This lets kexec and hibernate copy the hyp-stub when they need its
behaviour, instead of re-implementing it.

Signed-off-by: James Morse 

[Fixed merging issues]

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/hyp-stub.S | 59 ++--
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index ff329c5c074d..d1a73d0f74e0 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -21,6 +21,34 @@ SYM_CODE_START_LOCAL(\label)
.align 7
b   \label
 SYM_CODE_END(\label)
+.endm
+
+.macro hyp_stub_el1_sync
+SYM_CODE_START_LOCAL(hyp_stub_el1_sync)
+   .align 7
+   cmp x0, #HVC_SET_VECTORS
+   b.ne2f
+   msr vbar_el2, x1
+   b   9f
+
+2: cmp x0, #HVC_SOFT_RESTART
+   b.ne3f
+   mov x0, x2
+   mov x2, x4
+   mov x4, x1
+   mov x1, x3
+   br  x4  // no return
+
+3: cmp x0, #HVC_RESET_VECTORS
+   beq 9f  // Nothing to reset!
+
+   /* Someone called kvm_call_hyp() against the hyp-stub... */
+   mov_q   x0, HVC_STUB_ERR
+   eret
+
+9: mov x0, xzr
+   eret
+SYM_CODE_END(hyp_stub_el1_sync)
 .endm
 
.text
@@ -39,7 +67,7 @@ SYM_CODE_START(__hyp_stub_vectors)
invalid_vector  hyp_stub_el2h_fiq_invalid   // FIQ EL2h
invalid_vector  hyp_stub_el2h_error_invalid // Error EL2h
 
-   ventry  el1_sync// Synchronous 64-bit EL1
+   hyp_stub_el1_sync   // Synchronous 64-bit 
EL1
invalid_vector  hyp_stub_el1_irq_invalid// IRQ 64-bit EL1
invalid_vector  hyp_stub_el1_fiq_invalid// FIQ 64-bit EL1
invalid_vector  hyp_stub_el1_error_invalid  // Error 64-bit EL1
@@ -55,35 +83,6 @@ SYM_CODE_END(__hyp_stub_vectors)
 # Check the __hyp_stub_vectors didn't overflow
 .org . - (__hyp_stub_vectors_end - __hyp_stub_vectors) + SZ_2K
 
-
-SYM_CODE_START_LOCAL(el1_sync)
-   cmp x0, #HVC_SET_VECTORS
-   b.ne1f
-   msr vbar_el2, x1
-   b   9f
-
-1: cmp x0, #HVC_VHE_RESTART
-   b.eqmutate_to_vhe
-
-2: cmp x0, #HVC_SOFT_RESTART
-   b.ne3f
-   mov x0, x2
-   mov x2, x4
-   mov x4, x1
-   mov x1, x3
-   br  x4  // no return
-
-3: cmp x0, #HVC_RESET_VECTORS
-   beq 9f  // Nothing to reset!
-
-   /* Someone called kvm_call_hyp() against the hyp-stub... */
-   mov_q   x0, HVC_STUB_ERR
-   eret
-
-9: mov x0, xzr
-   eret
-SYM_CODE_END(el1_sync)
-
 // nVHE? No way! Give me the real thing!
 SYM_CODE_START_LOCAL(mutate_to_vhe)
// Sanity check: MMU *must* be off
-- 
2.25.1

[PATCH v12 04/17] arm64: kernel: add helper for booted at EL2 and not VHE

2021-03-03 Thread Pavel Tatashin

Replace places that contain logic like this:
is_hyp_mode_available() && !is_kernel_in_hyp_mode()

With a dedicated boolean function  is_hyp_callable(). This will be needed
later in kexec in order to sooner switch back to EL2.

Suggested-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/include/asm/virt.h | 5 +
 arch/arm64/kernel/cpu-reset.h | 3 +--
 arch/arm64/kernel/hibernate.c | 9 +++--
 arch/arm64/kernel/sdei.c  | 2 +-
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
index 7379f35ae2c6..4216c8623538 100644
--- a/arch/arm64/include/asm/virt.h
+++ b/arch/arm64/include/asm/virt.h
@@ -128,6 +128,11 @@ static __always_inline bool is_protected_kvm_enabled(void)
return cpus_have_final_cap(ARM64_KVM_PROTECTED_MODE);
 }
 
+static inline bool is_hyp_callable(void)
+{
+   return is_hyp_mode_available() && !is_kernel_in_hyp_mode();
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* ! __ASM__VIRT_H */
diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index ed50e9587ad8..1922e7a690f8 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -20,8 +20,7 @@ static inline void __noreturn cpu_soft_restart(unsigned long 
entry,
 {
typeof(__cpu_soft_restart) *restart;
 
-   unsigned long el2_switch = !is_kernel_in_hyp_mode() &&
-   is_hyp_mode_available();
+   unsigned long el2_switch = is_hyp_callable();
restart = (void *)__pa_symbol(__cpu_soft_restart);
 
cpu_install_idmap();
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index b1cef371df2b..c764574a1acb 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -48,9 +48,6 @@
  */
 extern int in_suspend;
 
-/* Do we need to reset el2? */
-#define el2_reset_needed() (is_hyp_mode_available() && 
!is_kernel_in_hyp_mode())
-
 /* temporary el2 vectors in the __hibernate_exit_text section. */
 extern char hibernate_el2_vectors[];
 
@@ -125,7 +122,7 @@ int arch_hibernation_header_save(void *addr, unsigned int 
max_size)
hdr->reenter_kernel = _cpu_resume;
 
/* We can't use __hyp_get_vectors() because kvm may still be loaded */
-   if (el2_reset_needed())
+   if (is_hyp_callable())
hdr->__hyp_stub_vectors = __pa_symbol(__hyp_stub_vectors);
else
hdr->__hyp_stub_vectors = 0;
@@ -387,7 +384,7 @@ int swsusp_arch_suspend(void)
dcache_clean_range(__idmap_text_start, __idmap_text_end);
 
/* Clean kvm setup code to PoC? */
-   if (el2_reset_needed()) {
+   if (is_hyp_callable()) {
dcache_clean_range(__hyp_idmap_text_start, 
__hyp_idmap_text_end);
dcache_clean_range(__hyp_text_start, __hyp_text_end);
}
@@ -482,7 +479,7 @@ int swsusp_arch_resume(void)
 *
 * We can skip this step if we booted at EL1, or are running with VHE.
 */
-   if (el2_reset_needed()) {
+   if (is_hyp_callable()) {
phys_addr_t el2_vectors = (phys_addr_t)hibernate_exit;
el2_vectors += hibernate_el2_vectors -
   __hibernate_exit_text_start; /* offset */
diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
index 2c7ca449dd51..af0ac2f920cf 100644
--- a/arch/arm64/kernel/sdei.c
+++ b/arch/arm64/kernel/sdei.c
@@ -200,7 +200,7 @@ unsigned long sdei_arch_get_entry_point(int conduit)
 * dropped to EL1 because we don't support VHE, then we can't support
 * SDEI.
 */
-   if (is_hyp_mode_available() && !is_kernel_in_hyp_mode()) {
+   if (is_hyp_callable()) {
pr_err("Not supported on this hardware/boot configuration\n");
goto out_err;
}
-- 
2.25.1

[PATCH v12 02/17] arm64: hyp-stub: Move invalid vector entries into the vectors

2021-03-03 Thread Pavel Tatashin

From: James Morse 

Most of the hyp-stub's vector entries are invalid. These are each
a unique function that branches to itself. To move these into the
vectors, merge the ventry and invalid_vector macros and give each
one a unique name.

This means we can copy the hyp-stub as it is self contained within
its vectors.

Signed-off-by: James Morse 

[Fixed merging issues]

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/hyp-stub.S | 56 +++-
 1 file changed, 23 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index 572b28646005..ff329c5c074d 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -16,31 +16,38 @@
 #include 
 #include 
 
+.macro invalid_vector  label
+SYM_CODE_START_LOCAL(\label)
+   .align 7
+   b   \label
+SYM_CODE_END(\label)
+.endm
+
.text
.pushsection.hyp.text, "ax"
 
.align 11
 
 SYM_CODE_START(__hyp_stub_vectors)
-   ventry  el2_sync_invalid// Synchronous EL2t
-   ventry  el2_irq_invalid // IRQ EL2t
-   ventry  el2_fiq_invalid // FIQ EL2t
-   ventry  el2_error_invalid   // Error EL2t
+   invalid_vector  hyp_stub_el2t_sync_invalid  // Synchronous EL2t
+   invalid_vector  hyp_stub_el2t_irq_invalid   // IRQ EL2t
+   invalid_vector  hyp_stub_el2t_fiq_invalid   // FIQ EL2t
+   invalid_vector  hyp_stub_el2t_error_invalid // Error EL2t
 
-   ventry  el2_sync_invalid// Synchronous EL2h
-   ventry  el2_irq_invalid // IRQ EL2h
-   ventry  el2_fiq_invalid // FIQ EL2h
-   ventry  el2_error_invalid   // Error EL2h
+   invalid_vector  hyp_stub_el2h_sync_invalid  // Synchronous EL2h
+   invalid_vector  hyp_stub_el2h_irq_invalid   // IRQ EL2h
+   invalid_vector  hyp_stub_el2h_fiq_invalid   // FIQ EL2h
+   invalid_vector  hyp_stub_el2h_error_invalid // Error EL2h
 
ventry  el1_sync// Synchronous 64-bit EL1
-   ventry  el1_irq_invalid // IRQ 64-bit EL1
-   ventry  el1_fiq_invalid // FIQ 64-bit EL1
-   ventry  el1_error_invalid   // Error 64-bit EL1
-
-   ventry  el1_sync_invalid// Synchronous 32-bit EL1
-   ventry  el1_irq_invalid // IRQ 32-bit EL1
-   ventry  el1_fiq_invalid // FIQ 32-bit EL1
-   ventry  el1_error_invalid   // Error 32-bit EL1
+   invalid_vector  hyp_stub_el1_irq_invalid// IRQ 64-bit EL1
+   invalid_vector  hyp_stub_el1_fiq_invalid// FIQ 64-bit EL1
+   invalid_vector  hyp_stub_el1_error_invalid  // Error 64-bit EL1
+
+   invalid_vector  hyp_stub_32b_el1_sync_invalid   // Synchronous 32-bit 
EL1
+   invalid_vector  hyp_stub_32b_el1_irq_invalid// IRQ 32-bit EL1
+   invalid_vector  hyp_stub_32b_el1_fiq_invalid// FIQ 32-bit EL1
+   invalid_vector  hyp_stub_32b_el1_error_invalid  // Error 32-bit EL1
.align 11
 SYM_INNER_LABEL(__hyp_stub_vectors_end, SYM_L_LOCAL)
 SYM_CODE_END(__hyp_stub_vectors)
@@ -173,23 +180,6 @@ SYM_CODE_END(enter_vhe)
 
.popsection
 
-.macro invalid_vector  label
-SYM_CODE_START_LOCAL(\label)
-   b \label
-SYM_CODE_END(\label)
-.endm
-
-   invalid_vector  el2_sync_invalid
-   invalid_vector  el2_irq_invalid
-   invalid_vector  el2_fiq_invalid
-   invalid_vector  el2_error_invalid
-   invalid_vector  el1_sync_invalid
-   invalid_vector  el1_irq_invalid
-   invalid_vector  el1_fiq_invalid
-   invalid_vector  el1_error_invalid
-
-   .popsection
-
 /*
  * __hyp_set_vectors: Call this after boot to set the initial hypervisor
  * vectors as part of hypervisor installation.  On an SMP system, this should
-- 
2.25.1

[PATCH v12 00/17] arm64: MMU enabled kexec relocation

2021-03-03 Thread Pavel Tatashin

 Removed: "add trans_pgd_create_empty"
- replace init_mm with NULL, and keep using non "__" version of
  populate functions.
v3:
- Split changes to create_safe_exec_page() into several patches for
  easier review as request by Mark Rutland. This is why this series
  has 3 more patches.
- Renamed trans_table to tans_pgd as agreed with Mark. The header
  comment in trans_pgd.c explains that trans stands for
  transitional page tables. Meaning they are used in transition
  between two kernels.
v2:
- Fixed hibernate bug reported by James Morse
- Addressed comments from James Morse:
  * More incremental changes to trans_table
  * Removed TRANS_FORCEMAP
  * Added kexec reboot data for image with 380M in size.

Enable MMU during kexec relocation in order to improve reboot performance.

If kexec functionality is used for a fast system update, with a minimal
downtime, the relocation of kernel + initramfs takes a significant portion
of reboot.

The reason for slow relocation is because it is done without MMU, and thus
not benefiting from D-Cache.

Performance data

For this experiment, the size of kernel plus initramfs is small, only 25M.
If initramfs was larger, than the improvements would be greater, as time
spent in relocation is proportional to the size of relocation.

Previously:
kernel shutdown 0.022131328s
relocation  0.440510736s
kernel startup  0.294706768s

Relocation was taking: 58.2% of reboot time

Now:
kernel shutdown 0.032066576s
relocation  0.022158152s
kernel startup  0.296055880s

Now: Relocation takes 6.3% of reboot time

Total reboot is x2.16 times faster.

With bigger userland (fitImage 380M), the reboot time is improved by 3.57s,
and is reduced from 3.9s down to 0.33s

Previous approaches and discussions
---
v11: 
https://lore.kernel.org/lkml/20210127172706.617195-1-pasha.tatas...@soleen.com
v10: 
https://lore.kernel.org/linux-arm-kernel/20210125191923.1060122-1-pasha.tatas...@soleen.com
v9: 
https://lore.kernel.org/lkml/20200326032420.27220-1-pasha.tatas...@soleen.com
v8: 
https://lore.kernel.org/lkml/20191204155938.2279686-1-pasha.tatas...@soleen.com
v7: 
https://lore.kernel.org/lkml/20191016200034.1342308-1-pasha.tatas...@soleen.com
v6: 
https://lore.kernel.org/lkml/20191004185234.31471-1-pasha.tatas...@soleen.com
v5: 
https://lore.kernel.org/lkml/20190923203427.294286-1-pasha.tatas...@soleen.com
v4: 
https://lore.kernel.org/lkml/20190909181221.309510-1-pasha.tatas...@soleen.com
v3: 
https://lore.kernel.org/lkml/20190821183204.23576-1-pasha.tatas...@soleen.com
v2: 
https://lore.kernel.org/lkml/20190817024629.26611-1-pasha.tatas...@soleen.com
v1: 
https://lore.kernel.org/lkml/20190801152439.11363-1-pasha.tatas...@soleen.com

James Morse (4):
  arm64: hyp-stub: Check the size of the HYP stub's vectors
  arm64: hyp-stub: Move invalid vector entries into the vectors
  arm64: hyp-stub: Move el1_sync into the vectors
  arm64: kexec: Use dcache ops macros instead of open-coding

Pavel Tatashin (13):
  arm64: kernel: add helper for booted at EL2 and not VHE
  arm64: trans_pgd: hibernate: Add trans_pgd_copy_el2_vectors
  arm64: hibernate: abstract ttrb0 setup function
  arm64: kexec: flush image and lists during kexec load time
  arm64: kexec: skip relocation code for inplace kexec
  arm64: kexec: pass kimage as the only argument to relocation function
  arm64: kexec: kexec may require EL2 vectors
  arm64: kexec: relocate in EL1 mode
  arm64: kexec: use ld script for relocation function
  arm64: kexec: install a copy of the linear-map
  arm64: kexec: keep MMU enabled during kexec relocation
  arm64: kexec: remove the pre-kexec PoC maintenance
  arm64: kexec: Remove cpu-reset.h

 arch/arm64/Kconfig   |   2 +-
 arch/arm64/include/asm/assembler.h   |  31 -
 arch/arm64/include/asm/kexec.h   |  12 ++
 arch/arm64/include/asm/mmu_context.h |  24 
 arch/arm64/include/asm/sections.h|   1 +
 arch/arm64/include/asm/trans_pgd.h   |   3 +
 arch/arm64/include/asm/virt.h|   8 ++
 arch/arm64/kernel/asm-offsets.c  |  11 ++
 arch/arm64/kernel/cpu-reset.S|   7 +-
 arch/arm64/kernel/cpu-reset.h|  32 -
 arch/arm64/kernel/hibernate-asm.S|  20 
 arch/arm64/kernel/hibernate.c|  56 +++--
 arch/arm64/kernel/hyp-stub.S |  95 +++
 arch/arm64/kernel/machine_kexec.c| 168 +++
 arch/arm64/kernel/relocate_kernel.S  |  72 ++--
 arch/arm64/kernel/sdei.c |   2 +-
 arch/arm64/kernel/vmlinux.lds.S  |  19 +++
 arch/arm64/mm/trans_pgd.c|  20 
 18 files changed, 314 insertions(+), 269 deletions(-)
 delete mode 100644 arch/arm64/kernel/cpu-reset.h

-- 
2.25.1

[PATCH v12 01/17] arm64: hyp-stub: Check the size of the HYP stub's vectors

2021-03-03 Thread Pavel Tatashin

From: James Morse 

Hibernate contains a set of temporary EL2 vectors used to 'park'
EL2 somewhere safe while all the memory is thrown in the air.
Making kexec do its relocations with the MMU on means they have to
be done at EL1, so EL2 has to be parked. This means yet another
set of vectors.

All these things do is HVC_SET_VECTORS and HVC_SOFT_RESTART, both
of which are implemented by the hyp-stub. Lets copy it instead
of re-inventing it.

To do this the hyp-stub's entrails need to be packed neatly inside
its 2K vectors.

Start by moving the final 2K alignment inside the end marker, and
add a build check that we didn't overflow 2K.

Signed-off-by: James Morse 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/kernel/hyp-stub.S | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index 5eccbd62fec8..572b28646005 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -41,9 +41,13 @@ SYM_CODE_START(__hyp_stub_vectors)
ventry  el1_irq_invalid // IRQ 32-bit EL1
ventry  el1_fiq_invalid // FIQ 32-bit EL1
ventry  el1_error_invalid   // Error 32-bit EL1
+   .align 11
+SYM_INNER_LABEL(__hyp_stub_vectors_end, SYM_L_LOCAL)
 SYM_CODE_END(__hyp_stub_vectors)
 
-   .align 11
+# Check the __hyp_stub_vectors didn't overflow
+.org . - (__hyp_stub_vectors_end - __hyp_stub_vectors) + SZ_2K
+
 
 SYM_CODE_START_LOCAL(el1_sync)
cmp x0, #HVC_SET_VECTORS
-- 
2.25.1

Re: [PATCH v3 1/1] arm64: mm: correct the inside linear map range during hotplug check

2021-02-22 Thread Pavel Tatashin

> Taking that won't help either though, because it will just explode when
> it meets 'mm' in Linus's tree.
>
> So here's what I think we need to do:
>
>   - I'll apply your v3 at -rc1
>   - You can send backports based on your -v2 for stable once the v3 has
> been merged upstream.
>
> Sound good?

Sounds good, I will send backport once v3 lands in Linus's tree.

Thanks,
Pasha

>
> Will

Re: [PATCH 1/1] kexec: move machine_kexec_post_load() to public interface

2021-02-19 Thread Pavel Tatashin

On Fri, Feb 19, 2021 at 2:14 PM Will Deacon  wrote:
>
> On Fri, Feb 19, 2021 at 02:06:31PM -0500, Pavel Tatashin wrote:
> > On Fri, Feb 19, 2021 at 12:53 PM Will Deacon  wrote:
> > >
> > > On Mon, Feb 15, 2021 at 01:59:08PM -0500, Pavel Tatashin wrote:
> > > > machine_kexec_post_load() is called after kexec load is finished. It 
> > > > must
> > > > be declared in public header not in kexec_internal.h
> > >
> > > Could you provide a log of what goes wrong without this patch, please?
> > >
> > > > Reported-by: kernel test robot 
> > >
> > > Do you have a link to the report, or did it not go to the list?
> >
> > Hi Will,
> >
> > https://lore.kernel.org/linux-arm-kernel/202102030727.gqtokach-...@intel.com/
> >
> > It is also linked in the cover letter.
>
> Ah, great. Please add that as a Link: tag in the patch, and in-line the
> compiler warning.

Version 2 of this fix:
https://lore.kernel.org/lkml/20210219195142.13571-1-pasha.tatas...@soleen.com/

With the tag link, and warning in-lined.

Thank you,
Pasha

[PATCH v2] kexec: move machine_kexec_post_load() to public interface

2021-02-19 Thread Pavel Tatashin

machine_kexec_post_load() is called after kexec load is finished. It must
declared in public header not in kexec_internal.h

Fixes the following compiler warning:

arch/arm64/kernel/machine_kexec.c:62:5: warning: no previous prototype for
function 'machine_kexec_post_load' [-Wmissing-prototypes]
   int machine_kexec_post_load(struct kimage *kimage)

Reported-by: kernel test robot 
Link: 
https://lore.kernel.org/linux-arm-kernel/202102030727.gqtokach-...@intel.com
Signed-off-by: Pavel Tatashin 
---
 include/linux/kexec.h   | 2 ++
 kernel/kexec_internal.h | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 9e93bef52968..3671b845cf28 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -309,6 +309,8 @@ extern void machine_kexec_cleanup(struct kimage *image);
 extern int kernel_kexec(void);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
+int machine_kexec_post_load(struct kimage *image);
+
 extern void __crash_kexec(struct pt_regs *);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 39d30ccf8d87..48aaf2ac0d0d 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -13,8 +13,6 @@ void kimage_terminate(struct kimage *image);
 int kimage_is_destination_range(struct kimage *image,
unsigned long start, unsigned long end);
 
-int machine_kexec_post_load(struct kimage *image);
-
 extern struct mutex kexec_mutex;
 
 #ifdef CONFIG_KEXEC_FILE
-- 
2.25.1

Re: [PATCH v3 1/1] arm64: mm: correct the inside linear map range during hotplug check

2021-02-19 Thread Pavel Tatashin

On Fri, Feb 19, 2021 at 2:18 PM Will Deacon  wrote:
>
> On Tue, Feb 16, 2021 at 10:03:51AM -0500, Pavel Tatashin wrote:
> > Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
> > linear map range is not checked correctly.
> >
> > The start physical address that linear map covers can be actually at the
> > end of the range because of randomization. Check that and if so reduce it
> > to 0.
> >
> > This can be verified on QEMU with setting kaslr-seed to ~0ul:
> >
> > memstart_offset_seed = 0x
> > START: __pa(_PAGE_OFFSET(vabits_actual)) = 9000c000
> > END:   __pa(PAGE_END - 1) =  1000bfff
> >
> > Signed-off-by: Pavel Tatashin 
> > Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating 
> > linear mapping")
> > Tested-by: Tyler Hicks 
> > ---
> >  arch/arm64/mm/mmu.c | 21 +++--
> >  1 file changed, 19 insertions(+), 2 deletions(-)
>
> I tried to queue this as a fix, but unfortunately it doesn't apply.
> Please can you send a v4 based on the arm64 for-next/fixes branch?

Hi Will,

The previous version, that is not built against linux-next would still
applies against current mainlein/for-next/fixes

https://lore.kernel.org/lkml/20210215192237.362706-2-pasha.tatas...@soleen.com/

I just tried it. I think it would make sense to take v2 fix, so it
could also be backported to stables.

Thank you,
Pasha

Re: [PATCH 1/1] kexec: move machine_kexec_post_load() to public interface

2021-02-19 Thread Pavel Tatashin

On Fri, Feb 19, 2021 at 12:53 PM Will Deacon  wrote:
>
> On Mon, Feb 15, 2021 at 01:59:08PM -0500, Pavel Tatashin wrote:
> > machine_kexec_post_load() is called after kexec load is finished. It must
> > be declared in public header not in kexec_internal.h
>
> Could you provide a log of what goes wrong without this patch, please?
>
> > Reported-by: kernel test robot 
>
> Do you have a link to the report, or did it not go to the list?

Hi Will,

https://lore.kernel.org/linux-arm-kernel/202102030727.gqtokach-...@intel.com/

It is also linked in the cover letter.

Thank you,
Pasha

[PATCH v3 0/1] correct the inside linear map range during hotplug check

2021-02-16 Thread Pavel Tatashin

v3: - Sync with linux-next where arch_get_mappable_range() was
  introduced.
v2: - Added test-by Tyler Hicks
- Addressed comments from Anshuman Khandual: moved check under
  IS_ENABLED(CONFIG_RANDOMIZE_BASE), added 
  WARN_ON(start_linear_pa > end_linear_pa);

Fixes a hotplug error that may occur on systems with CONFIG_RANDOMIZE_BASE
enabled.

Applies against linux-next.

v1:
https://lore.kernel.org/lkml/20210213012316.1525419-1-pasha.tatas...@soleen.com
v2:
https://lore.kernel.org/lkml/20210215192237.362706-1-pasha.tatas...@soleen.com

Pavel Tatashin (1):
  arm64: mm: correct the inside linear map range during hotplug check

 arch/arm64/mm/mmu.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

-- 
2.25.1

[PATCH v3 1/1] arm64: mm: correct the inside linear map range during hotplug check

2021-02-16 Thread Pavel Tatashin

Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
linear map range is not checked correctly.

The start physical address that linear map covers can be actually at the
end of the range because of randomization. Check that and if so reduce it
to 0.

This can be verified on QEMU with setting kaslr-seed to ~0ul:

memstart_offset_seed = 0x
START: __pa(_PAGE_OFFSET(vabits_actual)) = 9000c000
END:   __pa(PAGE_END - 1) =  1000bfff

Signed-off-by: Pavel Tatashin 
Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear 
mapping")
Tested-by: Tyler Hicks 
---
 arch/arm64/mm/mmu.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ef7698c4e2f0..0d9c115e427f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1447,6 +1447,22 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned 
long start, u64 size)
 struct range arch_get_mappable_range(void)
 {
struct range mhp_range;
+   u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
+   u64 end_linear_pa = __pa(PAGE_END - 1);
+
+   if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
+   /*
+* Check for a wrap, it is possible because of randomized linear
+* mapping the start physical address is actually bigger than
+* the end physical address. In this case set start to zero
+* because [0, end_linear_pa] range must still be able to cover
+* all addressable physical addresses.
+*/
+   if (start_linear_pa > end_linear_pa)
+   start_linear_pa = 0;
+   }
+
+   WARN_ON(start_linear_pa > end_linear_pa);
 
/*
 * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
@@ -1454,8 +1470,9 @@ struct range arch_get_mappable_range(void)
 * range which can be mapped inside this linear mapping range, must
 * also be derived from its end points.
 */
-   mhp_range.start = __pa(_PAGE_OFFSET(vabits_actual));
-   mhp_range.end =  __pa(PAGE_END - 1);
+   mhp_range.start = start_linear_pa;
+   mhp_range.end =  end_linear_pa;
+
return mhp_range;
 }
 
-- 
2.25.1

Re: [PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-16 Thread Pavel Tatashin

> There is a new generic framework which expects the platform to provide two
> distinct range points (low and high) for hotplug address comparison. Those
> range points can be different depending on whether address randomization
> is enabled and the flip occurs. But this comparison here in the platform
> code is going away.
>
> This patch needs to rebased on the new framework which is part of linux-next.
>
> https://patchwork.kernel.org/project/linux-mm/list/?series=425051

Hi Anshuman,

Thanks for letting me know. I will send an updated patch against linux-next.

Thank you,
Pasha

Re: [PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-16 Thread Pavel Tatashin

On Tue, Feb 16, 2021 at 2:36 AM Ard Biesheuvel  wrote:
>
> On Tue, 16 Feb 2021 at 04:12, Anshuman Khandual
>  wrote:
> >
> >
> >
> > On 2/16/21 1:21 AM, Pavel Tatashin wrote:
> > > On Mon, Feb 15, 2021 at 2:34 PM Ard Biesheuvel  wrote:
> > >>
> > >> On Mon, 15 Feb 2021 at 20:30, Pavel Tatashin  
> > >> wrote:
> > >>>
> > >>>> Can't we simply use signed arithmetic here? This expression works fine
> > >>>> if the quantities are all interpreted as s64 instead of u64
> > >>>
> > >>> I was thinking about that, but I do not like the idea of using sign
> > >>> arithmetics for physical addresses. Also, I am worried that someone in
> > >>> the future will unknowingly change it to unsigns or to phys_addr_t. It
> > >>> is safer to have start explicitly set to 0 in case of wrap.
> > >>
> > >> memstart_addr is already a s64 for this exact reason.
> > >
> > > memstart_addr is basically an offset and it must be negative. For
> > > example, this would not work if it was not signed:
> > > #define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> 
> > > PAGE_SHIFT))
> > >
> > > However, on powerpc it is phys_addr_t type.
> > >
> > >>
> > >> Btw, the KASLR check is incorrect: memstart_addr could also be
> > >> negative when running the 52-bit VA kernel on hardware that is only
> > >> 48-bit VA capable.
> > >
> > > Good point!
> > >
> > > if (IS_ENABLED(CONFIG_ARM64_VA_BITS_52) && (vabits_actual != 52))
> > > memstart_addr -= _PAGE_OFFSET(48) - _PAGE_OFFSET(52);
> > >
> > > So, I will remove IS_ENABLED(CONFIG_RANDOMIZE_BASE) again.
> > >
> > > I am OK to change start_linear_pa, end_linear_pa to signed, but IMO
> > > what I have now is actually safer to make sure that does not break
> > > again in the future.
> > An explicit check for the flip over and providing two different start
> > addresses points would be required in order to use the new framework.
>
> I don't think so. We no longer randomize over the same range, but take
> the support PA range into account. (97d6786e0669d)
>
> This should ensure that __pa(_PAGE_OFFSET(vabits_actual)) never
> assumes a negative value. And to Pavel's point re 48/52 bit VAs: the
> fact that vabits_actual appears in this expression means that it
> already takes this into account, so you are correct that we don't have
> to care about that here.
>
> So even if memstart_addr could be negative, this expression should
> never produce a negative value. And with the patch above applied, it
> should never do so when running under KASLR either.
>
> So question to Pavel and Tyler: could you please check whether you
> have that patch, and whether it fixes the issue? It was introduced in
> v5.11, and hasn't been backported yet (it wasn't marked for -stable)

97d6786e0669d
arm64: mm: account for hotplug memory when randomizing the linear region

Does not address the problem that is described in this bug. It only
addresses the problem of adding extra PA space to the linear map which
is indeed needed (btw is it possible that hot plug is going to add
below memblock_start_of_DRAM(), because that is not currently
accounted) , but not the fact that a linear map can start from high
addresses because of randomization. I have verified that in QEMU, and
Tyler verified it on real hardware backporting it to 5.10, the problem
that this patch fixes is still there.

Pasha

Re: [PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-15 Thread Pavel Tatashin

> >
> > Btw, the KASLR check is incorrect: memstart_addr could also be
> > negative when running the 52-bit VA kernel on hardware that is only
> > 48-bit VA capable.
>
> Good point!
>
> if (IS_ENABLED(CONFIG_ARM64_VA_BITS_52) && (vabits_actual != 52))
> memstart_addr -= _PAGE_OFFSET(48) - _PAGE_OFFSET(52);
>
> So, I will remove IS_ENABLED(CONFIG_RANDOMIZE_BASE) again.

Hi Ard,

Actually, looking more at this, I do not see how with 52VA on a 48VA
processor start offset can become negative unless randomization is
involved.
The start of the linear map will point to the first physical address
that is reported by memblock_start_of_DRAM(). However, memstart_addr
will be negative. So, I think the current approach using
IS_ENABLED(CONFIG_RANDOMIZE_BASE) is good.

48VA processor with VA_BITS_48:
memstart_addr 4000
start_linear_pa 4000
end_linear_pa 80003fff

48VA processor with VA_BITS_52:
memstart_addr fff14000   <- Negative
start_linear_pa 4000  <- positive, and the first PA address
end_linear_pa 80003fff

Thank you,
Pasha

Re: [PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-15 Thread Pavel Tatashin

On Mon, Feb 15, 2021 at 2:34 PM Ard Biesheuvel  wrote:
>
> On Mon, 15 Feb 2021 at 20:30, Pavel Tatashin  
> wrote:
> >
> > > Can't we simply use signed arithmetic here? This expression works fine
> > > if the quantities are all interpreted as s64 instead of u64
> >
> > I was thinking about that, but I do not like the idea of using sign
> > arithmetics for physical addresses. Also, I am worried that someone in
> > the future will unknowingly change it to unsigns or to phys_addr_t. It
> > is safer to have start explicitly set to 0 in case of wrap.
>
> memstart_addr is already a s64 for this exact reason.

memstart_addr is basically an offset and it must be negative. For
example, this would not work if it was not signed:
#define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))

However, on powerpc it is phys_addr_t type.

>
> Btw, the KASLR check is incorrect: memstart_addr could also be
> negative when running the 52-bit VA kernel on hardware that is only
> 48-bit VA capable.

Good point!

if (IS_ENABLED(CONFIG_ARM64_VA_BITS_52) && (vabits_actual != 52))
memstart_addr -= _PAGE_OFFSET(48) - _PAGE_OFFSET(52);

So, I will remove IS_ENABLED(CONFIG_RANDOMIZE_BASE) again.

I am OK to change start_linear_pa, end_linear_pa to signed, but IMO
what I have now is actually safer to make sure that does not break
again in the future.

Re: [PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-15 Thread Pavel Tatashin

> Can't we simply use signed arithmetic here? This expression works fine
> if the quantities are all interpreted as s64 instead of u64

I was thinking about that, but I do not like the idea of using sign
arithmetics for physical addresses. Also, I am worried that someone in
the future will unknowingly change it to unsigns or to phys_addr_t. It
is safer to have start explicitly set to 0 in case of wrap.

[PATCH v2 1/1] arm64: mm: correct the inside linear map boundaries during hotplug check

2021-02-15 Thread Pavel Tatashin

Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
linear map range is not checked correctly.

The start physical address that linear map covers can be actually at the
end of the range because of randomization. Check that and if so reduce it
to 0.

This can be verified on QEMU with setting kaslr-seed to ~0ul:

memstart_offset_seed = 0x
START: __pa(_PAGE_OFFSET(vabits_actual)) = 9000c000
END:   __pa(PAGE_END - 1) =  1000bfff

Signed-off-by: Pavel Tatashin 
Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear 
mapping")
Tested-by: Tyler Hicks 
---
 arch/arm64/mm/mmu.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ae0c3d023824..cc16443ea67f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1444,14 +1444,30 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned 
long start, u64 size)
 
 static bool inside_linear_region(u64 start, u64 size)
 {
+   u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
+   u64 end_linear_pa = __pa(PAGE_END - 1);
+
+   if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
+   /*
+* Check for a wrap, it is possible because of randomized linear
+* mapping the start physical address is actually bigger than
+* the end physical address. In this case set start to zero
+* because [0, end_linear_pa] range must still be able to cover
+* all addressable physical addresses.
+*/
+   if (start_linear_pa > end_linear_pa)
+   start_linear_pa = 0;
+   }
+
+   WARN_ON(start_linear_pa > end_linear_pa);
+
/*
 * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
 * accommodating both its ends but excluding PAGE_END. Max physical
 * range which can be mapped inside this linear mapping range, must
 * also be derived from its end points.
 */
-   return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
-  (start + size - 1) <= __pa(PAGE_END - 1);
+   return start >= start_linear_pa && (start + size - 1) <= end_linear_pa;
 }
 
 int arch_add_memory(int nid, u64 start, u64 size,
-- 
2.25.1

[PATCH v2 0/1] correct the inside linear map boundaries during hotplug check

2021-02-15 Thread Pavel Tatashin

v2:
- Added test-by Tyler Hicks
- Addressed comments from Anshuman Khandual: moved check under
  IS_ENABLED(CONFIG_RANDOMIZE_BASE), added 
  WARN_ON(start_linear_pa > end_linear_pa);

Fixes a hotplug error that may occur on systems with CONFIG_RANDOMIZE_BASE
enabled.

v1:
https://lore.kernel.org/lkml/20210213012316.1525419-1-pasha.tatas...@soleen.com

Pavel Tatashin (1):
  arm64: mm: correct the inside linear map boundaries during hotplug
check

 arch/arm64/mm/mmu.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

-- 
2.25.1

[PATCH 0/1] fix machine_kexec_post_load prototype.

2021-02-15 Thread Pavel Tatashin

This is against for-next/kexec, fix for machine_kexec_post_load
warning.

Reported by kernel test robot [1].

[1] https://lore.kernel.org/linux-arm-kernel/202102030727.gqtokach-...@intel.com

Pavel Tatashin (1):
  kexec: move machine_kexec_post_load() to public interface

 include/linux/kexec.h   | 2 ++
 kernel/kexec_internal.h | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

-- 
2.25.1

[PATCH 1/1] kexec: move machine_kexec_post_load() to public interface

2021-02-15 Thread Pavel Tatashin

machine_kexec_post_load() is called after kexec load is finished. It must
be declared in public header not in kexec_internal.h

Reported-by: kernel test robot 
Signed-off-by: Pavel Tatashin 
---
 include/linux/kexec.h   | 2 ++
 kernel/kexec_internal.h | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 9e93bef52968..3671b845cf28 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -309,6 +309,8 @@ extern void machine_kexec_cleanup(struct kimage *image);
 extern int kernel_kexec(void);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
+int machine_kexec_post_load(struct kimage *image);
+
 extern void __crash_kexec(struct pt_regs *);
 extern void crash_kexec(struct pt_regs *);
 int kexec_should_crash(struct task_struct *);
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 39d30ccf8d87..48aaf2ac0d0d 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -13,8 +13,6 @@ void kimage_terminate(struct kimage *image);
 int kimage_is_destination_range(struct kimage *image,
unsigned long start, unsigned long end);
 
-int machine_kexec_post_load(struct kimage *image);
-
 extern struct mutex kexec_mutex;
 
 #ifdef CONFIG_KEXEC_FILE
-- 
2.25.1

[PATCH v11 10/14] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning

2021-02-15 Thread Pavel Tatashin

Document the special handling of page pinning when ZONE_MOVABLE present.

Signed-off-by: Pavel Tatashin 
Suggested-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 Documentation/admin-guide/mm/memory-hotplug.rst | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst 
b/Documentation/admin-guide/mm/memory-hotplug.rst
index 5c4432c96c4b..c6618f99f765 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -357,6 +357,15 @@ creates ZONE_MOVABLE as following.
Unfortunately, there is no information to show which memory block belongs
to ZONE_MOVABLE. This is TBD.
 
+.. note::
+   Techniques that rely on long-term pinnings of memory (especially, RDMA and
+   vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
+   hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
+   memory can still get hot removed - be aware that pinning can fail even if
+   there is plenty of free memory in ZONE_MOVABLE. In addition, using
+   ZONE_MOVABLE might make page pinning more expensive, because pages have to 
be
+   migrated off that zone first.
+
 .. _memory_hotplug_how_to_offline_memory:
 
 How to offline memory
-- 
2.25.1

[PATCH v11 08/14] mm/gup: do not migrate zero page

2021-02-15 Thread Pavel Tatashin

On some platforms ZERO_PAGE(0) might end-up in a movable zone. Do not
migrate zero page in gup during longterm pinning as migration of zero page
is not allowed.

For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I
see the following:

Boot#1: zero_pfn  0x48a8d zero_pfn zone: ZONE_DMA32
Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE

On x86, empty_zero_page is declared in .bss and depending on the loader
may end up in different physical locations during boots.

Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because
zero_pfn that they are using is declared in memory.c which is compiled
with CONFIG_MMU.

Signed-off-by: Pavel Tatashin 
---
 include/linux/mm.h  |  3 ++-
 include/linux/mmzone.h  |  4 
 include/linux/pgtable.h | 12 
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f56d8d62148..3c75df55ed00 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1496,7 +1496,8 @@ static inline unsigned long page_to_section(const struct 
page *page)
 #ifdef CONFIG_MIGRATION
 static inline bool is_pinnable_page(struct page *page)
 {
-   return !is_zone_movable_page(page) && !is_migrate_cma_page(page);
+   return !(is_zone_movable_page(page) || is_migrate_cma_page(page)) ||
+   is_zero_pfn(page_to_pfn(page));
 }
 #else
 static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..c56f508be031 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -407,6 +407,10 @@ enum zone_type {
 *techniques might use alloc_contig_range() to hide previously
 *exposed pages from the buddy again (e.g., to implement some sort
 *of memory unplug in virtio-mem).
+* 6. ZERO_PAGE(0), kernelcore/movablecore setups might create
+*situations where ZERO_PAGE(0) which is allocated differently
+*on different platforms may end up in a movable zone. ZERO_PAGE(0)
+*cannot be migrated.
 *
 * In general, no unmovable allocations that degrade memory offlining
 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8fcdfa52eb4b..7c6cba3d80f0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1115,6 +1115,7 @@ extern void untrack_pfn(struct vm_area_struct *vma, 
unsigned long pfn,
 extern void untrack_pfn_moved(struct vm_area_struct *vma);
 #endif
 
+#ifdef CONFIG_MMU
 #ifdef __HAVE_COLOR_ZERO_PAGE
 static inline int is_zero_pfn(unsigned long pfn)
 {
@@ -1138,6 +1139,17 @@ static inline unsigned long my_zero_pfn(unsigned long 
addr)
return zero_pfn;
 }
 #endif
+#else
+static inline int is_zero_pfn(unsigned long pfn)
+{
+   return 0;
+}
+
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+   return 0;
+}
+#endif /* CONFIG_MMU */
 
 #ifdef CONFIG_MMU
 
-- 
2.25.1

[PATCH v11 09/14] mm/gup: migrate pinned pages out of movable zone

2021-02-15 Thread Pavel Tatashin

We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only
movable CMA pages. Generalize the function that migrates CMA pages to
migrate all movable pages. Use is_pinnable_page() to check which
pages need to be migrated

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
---
 include/linux/migrate.h|  1 +
 include/linux/mmzone.h |  9 -
 include/trace/events/migrate.h |  3 +-
 mm/gup.c   | 67 +-
 4 files changed, 44 insertions(+), 36 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4594838a0f7c..aae5ef0b3ba1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+   MR_LONGTERM_PIN,
MR_TYPES
 };
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c56f508be031..e8ccd4eab75e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -387,8 +387,13 @@ enum zone_type {
 * to increase the number of THP/huge pages. Notable special cases are:
 *
 * 1. Pinned pages: (long-term) pinning of movable pages might
-*essentially turn such pages unmovable. Memory offlining might
-*retry a long time.
+*essentially turn such pages unmovable. Therefore, we do not allow
+*pinning long-term pages in ZONE_MOVABLE. When pages are pinned and
+*faulted, they come from the right zone right away. However, it is
+*still possible that address space already has pages in
+*ZONE_MOVABLE at the time when pages are pinned (i.e. user has
+*touches that memory before pinning). In such case we migrate them
+*to a different zone. When migration fails - pinning fails.
 * 2. memblock allocations: kernelcore/movablecore setups might create
 *situations where ZONE_MOVABLE contains unmovable allocations
 *after boot. Memory offlining and allocations fail early.
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..363b54ce104c 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset")\
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
-   EMe(MR_CONTIG_RANGE,"contig_range")
+   EM( MR_CONTIG_RANGE,"contig_range") \
+   EMe(MR_LONGTERM_PIN,"longterm_pin")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff --git a/mm/gup.c b/mm/gup.c
index 9af6faf1b2b3..da6d370fe551 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -88,11 +88,12 @@ static __maybe_unused struct page 
*try_grab_compound_head(struct page *page,
int orig_refs = refs;
 
/*
-* Can't do FOLL_LONGTERM + FOLL_PIN with CMA in the gup fast
-* path, so fail and let the caller fall back to the slow path.
+* Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
+* right zone, so fail and let the caller fall back to the slow
+* path.
 */
-   if (unlikely(flags & FOLL_LONGTERM) &&
-   is_migrate_cma_page(page))
+   if (unlikely((flags & FOLL_LONGTERM) &&
+!is_pinnable_page(page)))
return NULL;
 
/*
@@ -1540,17 +1541,17 @@ struct page *get_dump_page(unsigned long addr)
 }
 #endif /* CONFIG_ELF_CORE */
 
-#ifdef CONFIG_CMA
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
-   struct page **pages,
-   struct vm_area_struct **vmas,
-   unsigned int gup_flags)
+#ifdef CONFIG_MIGRATION
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+   unsigned long start,
+   unsigned long nr_pages,
+   struct page **pages,
+   struct vm_area_struct **vmas,
+   unsigned int gup_flags)
 {
unsigned long i, isolation_error_count;
bool drain_allow;
-   LIST_HEAD(cma_page_list);
+   LIST_HEAD(movable_page_list);
long ret = nr_pages;
struct page *prev_head, *head;
struct migration_target_control mtc = {
@@ -1568,13 +1569,12 @@ static long

[PATCH v11 11/14] mm/gup: change index type to long as it counts pages

2021-02-15 Thread Pavel Tatashin

In __get_user_pages_locked() i counts number of pages which should be
long, as long is used in all other places to contain number of pages, and
32-bit becomes increasingly small for handling page count proportional
values.

Signed-off-by: Pavel Tatashin 
Acked-by: Michal Hocko 
---
 mm/gup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index da6d370fe551..fab20b934030 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1472,7 +1472,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, 
unsigned long start,
 {
struct vm_area_struct *vma;
unsigned long vm_flags;
-   int i;
+   long i;
 
/* calculate required read or write permissions.
 * If FOLL_FORCE is set, we only require the "MAY" flags.
-- 
2.25.1

[PATCH v11 12/14] mm/gup: longterm pin migration cleanup

2021-02-15 Thread Pavel Tatashin

When pages are longterm pinned, we must migrated them out of movable zone.
The function that migrates them has a hidden loop with goto. The loop is
to retry on isolation failures, and after successful migration.

Make this code better by moving this loop to the caller.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 93 ++--
 1 file changed, 37 insertions(+), 56 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index fab20b934030..905d550abb91 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1542,27 +1542,28 @@ struct page *get_dump_page(unsigned long addr)
 #endif /* CONFIG_ELF_CORE */
 
 #ifdef CONFIG_MIGRATION
-static long check_and_migrate_movable_pages(struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
+/*
+ * Check whether all pages are pinnable, if so return number of pages.  If some
+ * pages are not pinnable, migrate them, and unpin all pages. Return zero if
+ * pages were migrated, or if some pages were not successfully isolated.
+ * Return negative error if migration fails.
+ */
+static long check_and_migrate_movable_pages(unsigned long nr_pages,
struct page **pages,
-   struct vm_area_struct **vmas,
unsigned int gup_flags)
 {
-   unsigned long i, isolation_error_count;
-   bool drain_allow;
+   unsigned long i;
+   unsigned long isolation_error_count = 0;
+   bool drain_allow = true;
LIST_HEAD(movable_page_list);
-   long ret = nr_pages;
-   struct page *prev_head, *head;
+   long ret = 0;
+   struct page *prev_head = NULL;
+   struct page *head;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
};
 
-check_again:
-   prev_head = NULL;
-   isolation_error_count = 0;
-   drain_allow = true;
for (i = 0; i < nr_pages; i++) {
head = compound_head(pages[i]);
if (head == prev_head)
@@ -1600,47 +1601,27 @@ static long check_and_migrate_movable_pages(struct 
mm_struct *mm,
 * in the correct zone.
 */
if (list_empty(_page_list) && !isolation_error_count)
-   return ret;
+   return nr_pages;
 
+   if (gup_flags & FOLL_PIN) {
+   unpin_user_pages(pages, nr_pages);
+   } else {
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   }
if (!list_empty(_page_list)) {
-   /*
-* drop the above get_user_pages reference.
-*/
-   if (gup_flags & FOLL_PIN)
-   unpin_user_pages(pages, nr_pages);
-   else
-   for (i = 0; i < nr_pages; i++)
-   put_page(pages[i]);
-
ret = migrate_pages(_page_list, alloc_migration_target,
NULL, (unsigned long), MIGRATE_SYNC,
MR_LONGTERM_PIN);
-   if (ret) {
-   if (!list_empty(_page_list))
-   putback_movable_pages(_page_list);
-   return ret > 0 ? -ENOMEM : ret;
-   }
-
-   /* We unpinned pages before migration, pin them again */
-   ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
- NULL, gup_flags);
-   if (ret <= 0)
-   return ret;
-   nr_pages = ret;
+   if (ret && !list_empty(_page_list))
+   putback_movable_pages(_page_list);
}
 
-   /*
-* check again because pages were unpinned, and we also might have
-* had isolation errors and need more pages to migrate.
-*/
-   goto check_again;
+   return ret > 0 ? -ENOMEM : ret;
 }
 #else
-static long check_and_migrate_movable_pages(struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
+static long check_and_migrate_movable_pages(unsigned long nr_pages,
struct page **pages,
-   struct vm_area_struct **vmas,
unsigned int gup_flags)
 {
return nr_pages;
@@ -1658,22 +1639,22 @@ static long __gup_longterm_locked(struct mm_struct *mm,
  struct vm_area_struct **vmas,
  unsigned int gup_flags)
 {
-   unsigned long flags = 0;
+   unsigned int flags;
long rc;
 
-   if (gup_fl

[PATCH v11 13/14] selftests/vm: gup_test: fix test flag

2021-02-15 Thread Pavel Tatashin

In gup_test both gup_flags and test_flags use the same flags field.
This is broken.

Farther, in the actual gup_test.c all the passed gup_flags are erased and
unconditionally replaced with FOLL_WRITE.

Which means that test_flags are ignored, and code like this always
performs pin dump test:

155 if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
156 nr = pin_user_pages(addr, nr, gup->flags,
157 pages + i, NULL);
158 else
159 nr = get_user_pages(addr, nr, gup->flags,
160 pages + i, NULL);
161 break;

Add a new test_flags field, to allow raw gup_flags to work.
Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
should be performed.
Remove  unconditional overwriting of gup_flags via FOLL_WRITE. But,
preserve the previous behaviour where FOLL_WRITE was the default flag,
and add a new option "-W" to unset FOLL_WRITE.

Rename flags with gup_flags.

With the fix, dump works like this:

root@virtme:/# gup_test  -c
 page #0, starting from user virt addr: 0x7f8acb9e4000
page:d3d2ee27 refcount:2 mapcount:1 mapping:
index:0x0 pfn:0x100bcf
anon flags: 0x3080016(referenced|uptodate|lru|swapbacked)
raw: 03080016 d0e204021608 d0e208df2e88 8ea04243ec61
raw:   0002 
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

root@virtme:/# gup_test  -c -p
 page #0, starting from user virt addr: 0x7fd19701b000
page:baed3c7d refcount:1025 mapcount:1 mapping:
index:0x0 pfn:0x108008
anon flags: 0x3080014(uptodate|lru|swapbacked)
raw: 03080014 d0e204200188 d0e205e09088 8ea04243ee71
raw:   0401 
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

Refcount shows the difference between pin vs no-pin case.
Also change type of nr from int to long, as it counts number of pages.

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
---
 mm/gup_test.c | 23 ++-
 mm/gup_test.h |  3 ++-
 tools/testing/selftests/vm/gup_test.c | 15 +++
 3 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index e3cf78e5873e..a6ed1c877679 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
 {
ktime_t start_time, end_time;
unsigned long i, nr_pages, addr, next;
-   int nr;
+   long nr;
struct page **pages;
int ret = 0;
bool needs_mmap_lock =
@@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
-   /* Filter out most gup flags: only allow a tiny subset here: */
-   gup->flags &= FOLL_WRITE;
-
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags,
+   nr = get_user_pages_fast(addr, nr, gup->gup_flags,
 pages + i);
break;
case GUP_BASIC_TEST:
-   nr = get_user_pages(addr, nr, gup->flags, pages + i,
+   nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
NULL);
break;
case PIN_FAST_BENCHMARK:
-   nr = pin_user_pages_fast(addr, nr, gup->flags,
+   nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
 pages + i);
break;
case PIN_BASIC_TEST:
-   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
NULL);
break;
case PIN_LONGTERM_BENCHMARK:
nr = pin_user_pages(addr, nr,
-   gup->flags | FOLL_LONGTERM,
+   gup->gup_flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case DUMP_USER_PAGES_TEST:
-   if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
-   nr = pin_user_pages(addr, nr, gup->flags,
+   if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)

[PATCH v11 14/14] selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages

2021-02-15 Thread Pavel Tatashin

When pages are pinned they can be faulted in userland and migrated, and
they can be faulted right in kernel without migration.

In either case, the pinned pages must end-up being pinnable (not movable).

Add a new test to gup_test, to help verify that the gup/pup
(get_user_pages() / pin_user_pages()) behavior with respect to pinnable
and movable pages is reasonable and correct. Specifically, provide a
way to:

1) Verify that only "pinnable" pages are pinned. This is checked
automatically for you.

2) Verify that gup/pup performance is reasonable. This requires
comparing benchmarks between doing gup/pup on pages that have been
pre-faulted in from user space, vs. doing gup/pup on pages that are not
faulted in until gup/pup time (via FOLL_TOUCH). This decision is
controlled with the new -z command line option.

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
---
 mm/gup_test.c |  6 ++
 tools/testing/selftests/vm/gup_test.c | 23 +++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index a6ed1c877679..d974dec19e1c 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page 
**pages,
 
dump_page(page, "gup_test failure");
break;
+   } else if (cmd == PIN_LONGTERM_BENCHMARK &&
+   WARN(!is_pinnable_page(page),
+"pages[%lu] is NOT pinnable but pinned\n",
+i)) {
+   dump_page(page, "gup_test failure");
+   break;
}
}
break;
diff --git a/tools/testing/selftests/vm/gup_test.c 
b/tools/testing/selftests/vm/gup_test.c
index 943cc2608dc2..1e662d59c502 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -13,6 +13,7 @@
 
 /* Just the flags we need, copied from mm.h: */
 #define FOLL_WRITE 0x01/* check pte is writable */
+#define FOLL_TOUCH 0x02/* mark page accessed */
 
 static char *cmd_to_str(unsigned long cmd)
 {
@@ -39,11 +40,11 @@ int main(int argc, char **argv)
unsigned long size = 128 * MB;
int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
unsigned long cmd = GUP_FAST_BENCHMARK;
-   int flags = MAP_PRIVATE;
+   int flags = MAP_PRIVATE, touch = 0;
char *file = "/dev/zero";
char *p;
 
-   while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
+   while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -110,6 +111,10 @@ int main(int argc, char **argv)
case 'H':
flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
break;
+   case 'z':
+   /* fault pages in gup, do not fault in userland */
+   touch = 1;
+   break;
default:
return -1;
}
@@ -167,8 +172,18 @@ int main(int argc, char **argv)
else if (thp == 0)
madvise(p, size, MADV_NOHUGEPAGE);
 
-   for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
-   p[0] = 0;
+   /*
+* FOLL_TOUCH, in gup_test, is used as an either/or case: either
+* fault pages in from the kernel via FOLL_TOUCH, or fault them
+* in here, from user space. This allows comparison of performance
+* between those two cases.
+*/
+   if (touch) {
+   gup.gup_flags |= FOLL_TOUCH;
+   } else {
+   for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
+   p[0] = 0;
+   }
 
/* Only report timing information on the *_BENCHMARK commands: */
if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
-- 
2.25.1

[PATCH v11 06/14] mm: apply per-task gfp constraints in fast path

2021-02-15 Thread Pavel Tatashin

Function current_gfp_context() is called after fast path. However, soon we
will add more constraints which will also limit zones based on context.
Move this call into fast path, and apply the correct constraints for all
allocations.

Also update .reclaim_idx based on value returned by current_gfp_context()
because it soon will modify the allowed zones.

Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:

__alloc_pages_nodemask()
 prepare_alloc_pages()
  current_alloc_flags(gfp_mask, *alloc_flags);

Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.

Suggested-by: Michal Hocko 
Signed-off-by: Pavel Tatashin 
Acked-by: Michal Hocko 
---
 mm/page_alloc.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4b1eda87827..f6058e8f3105 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4981,6 +4981,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order, int preferred_nid,
}
 
gfp_mask &= gfp_allowed_mask;
+   /*
+* Apply scoped allocation constraints. This is mainly about GFP_NOFS
+* resp. GFP_NOIO which has to be inherited for all allocation requests
+* from a particular context which has been marked by
+* memalloc_no{fs,io}_{save,restore}.
+*/
+   gfp_mask = current_gfp_context(gfp_mask);
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, , 
_mask, _flags))
return NULL;
@@ -4996,13 +5003,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order, int preferred_nid,
if (likely(page))
goto out;
 
-   /*
-* Apply scoped allocation constraints. This is mainly about GFP_NOFS
-* resp. GFP_NOIO which has to be inherited for all allocation requests
-* from a particular context which has been marked by
-* memalloc_no{fs,io}_{save,restore}.
-*/
-   alloc_mask = current_gfp_context(gfp_mask);
+   alloc_mask = gfp_mask;
ac.spread_dirty_pages = false;
 
/*
-- 
2.25.1

[PATCH v11 07/14] mm: honor PF_MEMALLOC_PIN for all movable pages

2021-02-15 Thread Pavel Tatashin

PF_MEMALLOC_PIN is only honored for CMA pages, extend
this flag to work for any allocations from ZONE_MOVABLE by removing
__GFP_MOVABLE from gfp_mask when this flag is passed in the current
context.

Add is_pinnable_page() to return true if page is in a pinnable page.
A pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.

Signed-off-by: Pavel Tatashin 
Acked-by: Michal Hocko 
---
 include/linux/mm.h   | 18 ++
 include/linux/sched/mm.h |  6 +-
 mm/hugetlb.c |  2 +-
 mm/page_alloc.c  | 20 +---
 4 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..7f56d8d62148 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1116,6 +1116,11 @@ static inline bool is_zone_device_page(const struct page 
*page)
 }
 #endif
 
+static inline bool is_zone_movable_page(const struct page *page)
+{
+   return page_zonenum(page) == ZONE_MOVABLE;
+}
+
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
@@ -1487,6 +1492,19 @@ static inline unsigned long page_to_section(const struct 
page *page)
 }
 #endif
 
+/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
+#ifdef CONFIG_MIGRATION
+static inline bool is_pinnable_page(struct page *page)
+{
+   return !is_zone_movable_page(page) && !is_migrate_cma_page(page);
+}
+#else
+static inline bool is_pinnable_page(struct page *page)
+{
+   return true;
+}
+#endif
+
 static inline void set_page_zone(struct page *page, enum zone_type zone)
 {
page->flags &= ~(ZONES_MASK << ZONES_PGSHIFT);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5f4dd3274734..a55277b0d475 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -150,12 +150,13 @@ static inline bool in_vfork(struct task_struct *tsk)
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_PIN  implies !GFP_MOVABLE
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
unsigned int pflags = READ_ONCE(current->flags);
 
-   if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) {
+   if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | 
PF_MEMALLOC_PIN))) {
/*
 * NOIO implies both NOIO and NOFS and it is a weaker context
 * so always make sure it makes precedence
@@ -164,6 +165,9 @@ static inline gfp_t current_gfp_context(gfp_t flags)
flags &= ~(__GFP_IO | __GFP_FS);
else if (pflags & PF_MEMALLOC_NOFS)
flags &= ~__GFP_FS;
+
+   if (pflags & PF_MEMALLOC_PIN)
+   flags &= ~__GFP_MOVABLE;
}
return flags;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 861de87daf07..d1bcf5ed8df2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1052,7 +1052,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
-   if (pin && is_migrate_cma_page(page))
+   if (pin && !is_pinnable_page(page))
continue;
 
if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6058e8f3105..ed38a3ccb9eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3807,16 +3807,13 @@ alloc_flags_nofragment(struct zone *zone, gfp_t 
gfp_mask)
return alloc_flags;
 }
 
-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
-   unsigned int alloc_flags)
+/* Must be called after current_gfp_context() which can change gfp_mask */
+static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
+ unsigned int alloc_flags)
 {
 #ifdef CONFIG_CMA
-   unsigned int pflags = current->flags;
-
-   if (!(pflags & PF_MEMALLOC_PIN) &&
-   gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+   if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
-
 #endif
return alloc_flags;
 }
@@ -4472,7 +4469,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
 
-   alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
+   alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
 
return alloc_flags;
 }
@@ -4774,7 +4771,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
-   alloc_flags = current_alloc_flags(g

[PATCH v11 05/14] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

2021-02-15 Thread Pavel Tatashin

PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return
pages that might belong to CMA region. This is currently used for long
term gup to make sure that such pins are not going to be done on any CMA
pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is
focusing on CMA pages too much and that there is larger class of pages that
need the same treatment. MOVABLE zone cannot contain any long term pins as
well so it makes sense to reuse and redefine this flag for that usecase as
well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation
context which can only get pages suitable for long-term pins.

Also re-name:
memalloc_nocma_save()/memalloc_nocma_restore
to
memalloc_pin_save()/memalloc_pin_restore()
and make the new functions common.

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
Acked-by: Michal Hocko 
---
 include/linux/sched.h|  2 +-
 include/linux/sched/mm.h | 21 +
 mm/gup.c |  4 ++--
 mm/hugetlb.c |  4 ++--
 mm/page_alloc.c  |  4 ++--
 5 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..0fbb03bb776e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1568,7 +1568,7 @@ extern struct pid *cad_pid;
 #define PF_SWAPWRITE   0x0080  /* Allowed to write to swap */
 #define PF_NO_SETAFFINITY  0x0400  /* Userland is not allowed to 
meddle with cpus_mask */
 #define PF_MCE_EARLY   0x0800  /* Early kill for mce process 
policy */
-#define PF_MEMALLOC_NOCMA  0x1000  /* All allocation request will 
have _GFP_MOVABLE cleared */
+#define PF_MEMALLOC_PIN0x1000  /* Allocation context 
constrained to zones which allow long term pinning. */
 #define PF_FREEZER_SKIP0x4000  /* Freezer should not 
count it as freezable */
 #define PF_SUSPEND_TASK0x8000  /* This thread called 
freeze_processes() and should not be frozen */
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1ae08b8462a4..5f4dd3274734 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -270,29 +270,18 @@ static inline void memalloc_noreclaim_restore(unsigned 
int flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
 }
 
-#ifdef CONFIG_CMA
-static inline unsigned int memalloc_nocma_save(void)
+static inline unsigned int memalloc_pin_save(void)
 {
-   unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+   unsigned int flags = current->flags & PF_MEMALLOC_PIN;
 
-   current->flags |= PF_MEMALLOC_NOCMA;
+   current->flags |= PF_MEMALLOC_PIN;
return flags;
 }
 
-static inline void memalloc_nocma_restore(unsigned int flags)
+static inline void memalloc_pin_restore(unsigned int flags)
 {
-   current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+   current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
 }
-#else
-static inline unsigned int memalloc_nocma_save(void)
-{
-   return 0;
-}
-
-static inline void memalloc_nocma_restore(unsigned int flags)
-{
-}
-#endif
 
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
diff --git a/mm/gup.c b/mm/gup.c
index be57836ba90f..9af6faf1b2b3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1662,7 +1662,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
long rc;
 
if (gup_flags & FOLL_LONGTERM)
-   flags = memalloc_nocma_save();
+   flags = memalloc_pin_save();
 
rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
 gup_flags);
@@ -1671,7 +1671,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
if (rc > 0)
rc = check_and_migrate_cma_pages(mm, start, rc, pages,
 vmas, gup_flags);
-   memalloc_nocma_restore(flags);
+   memalloc_pin_restore(flags);
}
return rc;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..861de87daf07 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1049,10 +1049,10 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 {
struct page *page;
-   bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
+   bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
-   if (nocma && is_migrate_cma_page(page))
+   if (pin && is_migrate_cma_page(page))
continue;
 
if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..e4

[PATCH v11 04/14] mm/gup: check for isolation errors

2021-02-15 Thread Pavel Tatashin

It is still possible that we pin movable CMA pages if there are isolation
errors and cma_page_list stays empty when we check again.

Check for isolation errors, and return success only when there are no
isolation errors, and cma_page_list is empty after checking.

Because isolation errors are transient, we retry indefinitely.

Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages 
allocated from CMA region")
Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 60 
 1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2d0292980b1d..be57836ba90f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1548,8 +1548,8 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
struct vm_area_struct **vmas,
unsigned int gup_flags)
 {
-   unsigned long i;
-   bool drain_allow = true;
+   unsigned long i, isolation_error_count;
+   bool drain_allow;
LIST_HEAD(cma_page_list);
long ret = nr_pages;
struct page *prev_head, *head;
@@ -1560,6 +1560,8 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
 
 check_again:
prev_head = NULL;
+   isolation_error_count = 0;
+   drain_allow = true;
for (i = 0; i < nr_pages; i++) {
head = compound_head(pages[i]);
if (head == prev_head)
@@ -1571,25 +1573,35 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
 * of the CMA zone if possible.
 */
if (is_migrate_cma_page(head)) {
-   if (PageHuge(head))
-   isolate_huge_page(head, _page_list);
-   else {
+   if (PageHuge(head)) {
+   if (!isolate_huge_page(head, _page_list))
+   isolation_error_count++;
+   } else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
drain_allow = false;
}
 
-   if (!isolate_lru_page(head)) {
-   list_add_tail(>lru, 
_page_list);
-   mod_node_page_state(page_pgdat(head),
-   NR_ISOLATED_ANON +
-   
page_is_file_lru(head),
-   thp_nr_pages(head));
+   if (isolate_lru_page(head)) {
+   isolation_error_count++;
+   continue;
}
+   list_add_tail(>lru, _page_list);
+   mod_node_page_state(page_pgdat(head),
+   NR_ISOLATED_ANON +
+   page_is_file_lru(head),
+   thp_nr_pages(head));
}
}
}
 
+   /*
+* If list is empty, and no isolation errors, means that all pages are
+* in the correct zone.
+*/
+   if (list_empty(_page_list) && !isolation_error_count)
+   return ret;
+
if (!list_empty(_page_list)) {
/*
 * drop the above get_user_pages reference.
@@ -1609,23 +1621,19 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
return ret > 0 ? -ENOMEM : ret;
}
 
-   /*
-* We did migrate all the pages, Try to get the page references
-* again migrating any new CMA pages which we failed to isolate
-* earlier.
-*/
-   ret = __get_user_pages_locked(mm, start, nr_pages,
-  pages, vmas, NULL,
-  gup_flags);
-
-   if (ret > 0) {
-   nr_pages = ret;
-   drain_allow = true;
-   goto check_again;
-   }
+   /* We unpinned pages before migration, pin them again */
+   ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+ NULL, gup_flags);
+   if (ret <= 0)
+   return ret;
+   nr_pages = ret;
}
 
-   return ret;
+   /*
+* check again because pages were unpinned, and we also might have
+* had iso

[PATCH v11 03/14] mm/gup: return an error on migration failure

2021-02-15 Thread Pavel Tatashin

When migration failure occurs, we still pin pages, which means
that we may pin CMA movable pages which should never be the case.

Instead return an error without pinning pages when migration failure
happens.

No need to retry migrating, because migrate_pages() already retries
10 times.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 11ca49f3f11d..2d0292980b1d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1550,7 +1550,6 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
 {
unsigned long i;
bool drain_allow = true;
-   bool migrate_allow = true;
LIST_HEAD(cma_page_list);
long ret = nr_pages;
struct page *prev_head, *head;
@@ -1601,17 +1600,15 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
 
-   if (migrate_pages(_page_list, alloc_migration_target, NULL,
-   (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE)) {
-   /*
-* some of the pages failed migration. Do get_user_pages
-* without migration.
-*/
-   migrate_allow = false;
-
+   ret = migrate_pages(_page_list, alloc_migration_target,
+   NULL, (unsigned long), MIGRATE_SYNC,
+   MR_CONTIG_RANGE);
+   if (ret) {
if (!list_empty(_page_list))
putback_movable_pages(_page_list);
+   return ret > 0 ? -ENOMEM : ret;
}
+
/*
 * We did migrate all the pages, Try to get the page references
 * again migrating any new CMA pages which we failed to isolate
@@ -1621,7 +1618,7 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
   pages, vmas, NULL,
   gup_flags);
 
-   if ((ret > 0) && migrate_allow) {
+   if (ret > 0) {
nr_pages = ret;
drain_allow = true;
goto check_again;
-- 
2.25.1

[PATCH v11 02/14] mm/gup: check every subpage of a compound page during isolation

2021-02-15 Thread Pavel Tatashin

When pages are isolated in check_and_migrate_movable_pages() we skip
compound number of pages at a time. However, as Jason noted, it is
not necessary correct that pages[i] corresponds to the pages that
we skipped. This is because it is possible that the addresses in
this range had split_huge_pmd()/split_huge_pud(), and these functions
do not update the compound page metadata.

The problem can be reproduced if something like this occurs:

1. User faulted huge pages.
2. split_huge_pmd() was called for some reason
3. User has unmapped some sub-pages in the range
4. User tries to longterm pin the addresses.

The resulting pages[i] might end-up having pages which are not compound
size page aligned.

Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge 
page")
Reported-by: Jason Gunthorpe 
Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index df92170e3730..11ca49f3f11d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1549,26 +1549,23 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
unsigned int gup_flags)
 {
unsigned long i;
-   unsigned long step;
bool drain_allow = true;
bool migrate_allow = true;
LIST_HEAD(cma_page_list);
long ret = nr_pages;
+   struct page *prev_head, *head;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
};
 
 check_again:
-   for (i = 0; i < nr_pages;) {
-
-   struct page *head = compound_head(pages[i]);
-
-   /*
-* gup may start from a tail page. Advance step by the left
-* part.
-*/
-   step = compound_nr(head) - (pages[i] - head);
+   prev_head = NULL;
+   for (i = 0; i < nr_pages; i++) {
+   head = compound_head(pages[i]);
+   if (head == prev_head)
+   continue;
+   prev_head = head;
/*
 * If we get a page from the CMA zone, since we are going to
 * be pinning these entries, we might as well move them out
@@ -1592,8 +1589,6 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
}
}
}
-
-   i += step;
}
 
if (!list_empty(_page_list)) {
-- 
2.25.1

[PATCH v11 00/14] prohibit pinning pages in ZONE_MOVABLE

2021-02-15 Thread Pavel Tatashin

Changelog
-
v11
- Another build fix reported by robot on i386: moved is_pinnable_page()
  below set_page_section() in linux/mm.h

v10
- Fixed !CONFIG_MMU compiler issues by adding is_zero_pfn() stub.

v9
- Renamed gpf_to_alloc_flags() to gfp_to_alloc_flags_cma(); thanks Lecopzer
  Chen for noticing.
- Fixed warning reported scripts/checkpatch.pl:
  "Logical continuations should be on the previous line"

v8
- Added reviewed by's from John Hubbard
- Fixed subjects for selftests patches
- Moved zero page check inside is_pinnable_page() as requested by Jason
  Gunthorpe.

v7
- Added reviewed-by's
- Fixed a compile bug on non-mmu builds reported by robot

v6
  Small update, but I wanted to send it out quicker, as it removes a
  controversial patch and replaces it with something sane.
- Removed forcing FOLL_WRITE for longterm gup, instead added a patch to
  skip zero pages during migration.
- Added reviewed-by's and minor log changes.

v5
- Added the following patches to the beginning of series, which are fixes
   to the other existing problems with CMA migration code:
mm/gup: check every subpage of a compound page during isolation
mm/gup: return an error on migration failure
mm/gup: check for isolation errors also at the beginning of series
mm/gup: do not allow zero page for pinned pages
- remove .gfp_mask/.reclaim_idx changes from mm/vmscan.c
- update movable zone header comment in patch 8 instead of patch 3, fix
  the comment
- Added acked, sign-offs
- Updated commit logs based on feedback
- Addressed issues reported by Michal and Jason.
- Remove:
#define PINNABLE_MIGRATE_MAX10
#define PINNABLE_ISOLATE_MAX100
   Instead: fail on the first migration failure, and retry isolation
   forever as their failures are transient.

- In self-set addressed some of the comments from John Hubbard, updated
  commit logs, and added comments. Renamed gup->flags with gup->test_flags.

v4
- Address page migration comments. New patch:
  mm/gup: limit number of gup migration failures, honor failures
  Implements the limiting number of retries for migration failures, and
  also check for isolation failures.
  Added a test case into gup_test to verify that pages never long-term
  pinned in a movable zone, and also added tests to fault both in kernel
  and in userland.
v3
- Merged with linux-next, which contains clean-up patch from Jason,
  therefore this series is reduced by two patches which did the same
  thing.
v2
- Addressed all review comments
- Added Reviewed-by's.
- Renamed PF_MEMALLOC_NOMOVABLE to PF_MEMALLOC_PIN
- Added is_pinnable_page() to check if page can be longterm pinned
- Fixed gup fast path by checking is_in_pinnable_zone()
- rename cma_page_list to movable_page_list
- add a admin-guide note about handling pinned pages in ZONE_MOVABLE,
  updated caveat about pinned pages from linux/mmzone.h
- Move current_gfp_context() to fast-path

-
When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.

This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().

However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.

This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.

It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.

For more information read the discussion [1] about this problem.
[1] 
https://lore.kernel.org/lkml/ca+ck2bbffhbxjmb9jmskacm0fjminyt3nhk8nx6iudcqsj8...@mail.gmail.com

Previous versions:
v1
https://lore.kernel.org/lkml/20201202052330.474592-1-pasha.tatas...@soleen.com
v2
https://lore.kernel.org/lkml/20201210004335.64634-1-pasha.tatas...@soleen.com
v3
https://lore.kernel.org/lkml/20201211202140.396852-1-pasha.tatas...@soleen.com
v4
https://lore.kernel.org/lkml/20201217185243.3288048-1-pasha.tatas...@soleen.com
v5
https://lore.kernel.org/lkml/20210119043920.155044-1-pasha.tatas...@soleen.com
v6
https://lore.kernel.org/lkml/20210120014333.222547-1-pasha.tatas...@soleen.com
v7
https://lore.kernel.org/lkml/20210122033748.924330-1-pasha.tatas...@soleen.com
v8
https://lore.kernel.org/lkml/20210125194751.1275316-1-pasha.tatas...@soleen.com
v9
https://lore.kernel.org/lkml/20210201153827.444374-1-pasha.tatas...@soleen.com
v10
https://lore.kernel.org/lkml/20210211162427.618913-1-pasha.tatas...@soleen.com

Pavel Tatashin (14):
  mm/gup: don't pin migrated cma pages in movable zone
  mm/gup: check every subpage of a compound page during isolation
  mm/gup: return an error on migration failure
  mm/gup: check for isolation errors
  mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
  mm: apply per-task gfp constraints in fast path
  mm: honor PF_MEMALLOC_PIN for all mo

[PATCH v11 01/14] mm/gup: don't pin migrated cma pages in movable zone

2021-02-15 Thread Pavel Tatashin

In order not to fragment CMA the pinned pages are migrated. However,
they are migrated to ZONE_MOVABLE, which also should not have pinned pages.

Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
is allowed.

Signed-off-by: Pavel Tatashin 
Reviewed-by: David Hildenbrand 
Reviewed-by: John Hubbard 
Acked-by: Michal Hocko 
---
 mm/gup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..df92170e3730 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1556,7 +1556,7 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
-   .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
+   .gfp_mask = GFP_USER | __GFP_NOWARN,
};
 
 check_again:
-- 
2.25.1

Re: [PATCH] arm64: mm: correct the start of physical address in linear map

2021-02-15 Thread Pavel Tatashin

On Mon, Feb 15, 2021 at 12:26 AM Anshuman Khandual
 wrote:
>
> Hello Pavel,
>
> On 2/13/21 6:53 AM, Pavel Tatashin wrote:
> > Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
> > linear map range is not checked correctly.
> >
> > The start physical address that linear map covers can be actually at the
> > end of the range because of randmomization. Check that and if so reduce it
> > to 0.
>
> Looking at the code, this seems possible if memstart_addr which is a signed
> value becomes large (after falling below 0) during arm64_memblock_init().

Right.

>
> >
> > This can be verified on QEMU with setting kaslr-seed to ~0ul:
> >
> > memstart_offset_seed = 0x
> > START: __pa(_PAGE_OFFSET(vabits_actual)) = ffff9000c000
> > END:   __pa(PAGE_END - 1) =  1000bfff
> >
> > Signed-off-by: Pavel Tatashin 
> > Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating 
> > linear mapping")
> > ---
> >  arch/arm64/mm/mmu.c | 15 +--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index ae0c3d023824..6057ecaea897 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -1444,14 +1444,25 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
> > unsigned long start, u64 size)
> >
> >  static bool inside_linear_region(u64 start, u64 size)
> >  {
> > + u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
> > + u64 end_linear_pa = __pa(PAGE_END - 1);
> > +
> > + /*
> > +  * Check for a wrap, it is possible because of randomized linear 
> > mapping
> > +  * the start physical address is actually bigger than the end physical
> > +  * address. In this case set start to zero because [0, end_linear_pa]
> > +  * range must still be able to cover all addressable physical 
> > addresses.
> > +  */
>
> If this is possible only with randomized linear mapping, could you please
> add IS_ENABLED(CONFIG_RANDOMIZED_BASE) during the switch over. Wondering
> if WARN_ON(start_linear_pa > end_linear_pa) should be added otherwise i.e
> when linear mapping randomization is not enabled.

Yeah, good idea, I will add ifdef for CONFIG_RANDOMIZED_BASE.

>
> > + if (start_linear_pa > end_linear_pa)
> > + start_linear_pa = 0;
>
> This looks okay but will double check and give it some more testing.

Thank you,
Pasha

Re: [PATCH] arm64: mm: correct the start of physical address in linear map

2021-02-13 Thread Pavel Tatashin

> We're ignoring the portion from the linear mapping's start PA to the
> point of wraparound. Could the start and end of the hot plugged memory
> fall within this range and, as a result, the hot plug operation be
> incorrectly blocked?

Hi Tyler,

Thank you for looking at this fix. The maximum addressable PA's can be
seen in this function: id_aa64mmfr0_parange_to_phys_shift(). For
example for PA shift 32, the linear map must be able to cover any
physical addresses from 0 to "1 << 32". Therefore, 0 to __pa(PAGE_END
- 1); must include 0 to "1<<32".

The randomization of the linear map tries to hide where exactly within
the linear map the [0 to max_phys] addresses are located by changing
PHYS_OFFSET (linear map space is usually much bigger than PA space).
Therefore, the beginning or end of a linear map can actually convert
to completely bagus high PA addresses, but this is normal.

Thank you,
Pasha

>
> Tyler
>
> > +
> >   /*
> >* Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
> >* accommodating both its ends but excluding PAGE_END. Max physical
> >* range which can be mapped inside this linear mapping range, must
> >* also be derived from its end points.
> >*/
> > - return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
> > -(start + size - 1) <= __pa(PAGE_END - 1);
> > + return start >= start_linear_pa && (start + size - 1) <= 
> > end_linear_pa;
> >  }
> >
> >  int arch_add_memory(int nid, u64 start, u64 size,
> > --
> > 2.25.1
> >

[PATCH] arm64: mm: correct the start of physical address in linear map

2021-02-12 Thread Pavel Tatashin

Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
linear map range is not checked correctly.

The start physical address that linear map covers can be actually at the
end of the range because of randmomization. Check that and if so reduce it
to 0.

This can be verified on QEMU with setting kaslr-seed to ~0ul:

memstart_offset_seed = 0x
START: __pa(_PAGE_OFFSET(vabits_actual)) = 9000c000
END:   __pa(PAGE_END - 1) =  1000bfff

Signed-off-by: Pavel Tatashin 
Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear 
mapping")
---
 arch/arm64/mm/mmu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ae0c3d023824..6057ecaea897 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1444,14 +1444,25 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned 
long start, u64 size)
 
 static bool inside_linear_region(u64 start, u64 size)
 {
+   u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
+   u64 end_linear_pa = __pa(PAGE_END - 1);
+
+   /*
+* Check for a wrap, it is possible because of randomized linear mapping
+* the start physical address is actually bigger than the end physical
+* address. In this case set start to zero because [0, end_linear_pa]
+* range must still be able to cover all addressable physical addresses.
+*/
+   if (start_linear_pa > end_linear_pa)
+   start_linear_pa = 0;
+
/*
 * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
 * accommodating both its ends but excluding PAGE_END. Max physical
 * range which can be mapped inside this linear mapping range, must
 * also be derived from its end points.
 */
-   return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
-  (start + size - 1) <= __pa(PAGE_END - 1);
+   return start >= start_linear_pa && (start + size - 1) <= end_linear_pa;
 }
 
 int arch_add_memory(int nid, u64 start, u64 size,
-- 
2.25.1

[PATCH v10 14/14] selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages

2021-02-11 Thread Pavel Tatashin

When pages are pinned they can be faulted in userland and migrated, and
they can be faulted right in kernel without migration.

In either case, the pinned pages must end-up being pinnable (not movable).

Add a new test to gup_test, to help verify that the gup/pup
(get_user_pages() / pin_user_pages()) behavior with respect to pinnable
and movable pages is reasonable and correct. Specifically, provide a
way to:

1) Verify that only "pinnable" pages are pinned. This is checked
automatically for you.

2) Verify that gup/pup performance is reasonable. This requires
comparing benchmarks between doing gup/pup on pages that have been
pre-faulted in from user space, vs. doing gup/pup on pages that are not
faulted in until gup/pup time (via FOLL_TOUCH). This decision is
controlled with the new -z command line option.

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
---
 mm/gup_test.c |  6 ++
 tools/testing/selftests/vm/gup_test.c | 23 +++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index a6ed1c877679..d974dec19e1c 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page 
**pages,
 
dump_page(page, "gup_test failure");
break;
+   } else if (cmd == PIN_LONGTERM_BENCHMARK &&
+   WARN(!is_pinnable_page(page),
+"pages[%lu] is NOT pinnable but pinned\n",
+i)) {
+   dump_page(page, "gup_test failure");
+   break;
}
}
break;
diff --git a/tools/testing/selftests/vm/gup_test.c 
b/tools/testing/selftests/vm/gup_test.c
index 943cc2608dc2..1e662d59c502 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -13,6 +13,7 @@
 
 /* Just the flags we need, copied from mm.h: */
 #define FOLL_WRITE 0x01/* check pte is writable */
+#define FOLL_TOUCH 0x02/* mark page accessed */
 
 static char *cmd_to_str(unsigned long cmd)
 {
@@ -39,11 +40,11 @@ int main(int argc, char **argv)
unsigned long size = 128 * MB;
int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
unsigned long cmd = GUP_FAST_BENCHMARK;
-   int flags = MAP_PRIVATE;
+   int flags = MAP_PRIVATE, touch = 0;
char *file = "/dev/zero";
char *p;
 
-   while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
+   while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -110,6 +111,10 @@ int main(int argc, char **argv)
case 'H':
flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
break;
+   case 'z':
+   /* fault pages in gup, do not fault in userland */
+   touch = 1;
+   break;
default:
return -1;
}
@@ -167,8 +172,18 @@ int main(int argc, char **argv)
else if (thp == 0)
madvise(p, size, MADV_NOHUGEPAGE);
 
-   for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
-   p[0] = 0;
+   /*
+* FOLL_TOUCH, in gup_test, is used as an either/or case: either
+* fault pages in from the kernel via FOLL_TOUCH, or fault them
+* in here, from user space. This allows comparison of performance
+* between those two cases.
+*/
+   if (touch) {
+   gup.gup_flags |= FOLL_TOUCH;
+   } else {
+   for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
+   p[0] = 0;
+   }
 
/* Only report timing information on the *_BENCHMARK commands: */
if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
-- 
2.25.1

[PATCH v10 13/14] selftests/vm: gup_test: fix test flag

2021-02-11 Thread Pavel Tatashin

In gup_test both gup_flags and test_flags use the same flags field.
This is broken.

Farther, in the actual gup_test.c all the passed gup_flags are erased and
unconditionally replaced with FOLL_WRITE.

Which means that test_flags are ignored, and code like this always
performs pin dump test:

155 if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
156 nr = pin_user_pages(addr, nr, gup->flags,
157 pages + i, NULL);
158 else
159 nr = get_user_pages(addr, nr, gup->flags,
160 pages + i, NULL);
161 break;

Add a new test_flags field, to allow raw gup_flags to work.
Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
should be performed.
Remove  unconditional overwriting of gup_flags via FOLL_WRITE. But,
preserve the previous behaviour where FOLL_WRITE was the default flag,
and add a new option "-W" to unset FOLL_WRITE.

Rename flags with gup_flags.

With the fix, dump works like this:

root@virtme:/# gup_test  -c
 page #0, starting from user virt addr: 0x7f8acb9e4000
page:d3d2ee27 refcount:2 mapcount:1 mapping:
index:0x0 pfn:0x100bcf
anon flags: 0x3080016(referenced|uptodate|lru|swapbacked)
raw: 03080016 d0e204021608 d0e208df2e88 8ea04243ec61
raw:   0002 
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

root@virtme:/# gup_test  -c -p
 page #0, starting from user virt addr: 0x7fd19701b000
page:baed3c7d refcount:1025 mapcount:1 mapping:
index:0x0 pfn:0x108008
anon flags: 0x3080014(uptodate|lru|swapbacked)
raw: 03080014 d0e204200188 d0e205e09088 8ea04243ee71
raw:   0401 
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

Refcount shows the difference between pin vs no-pin case.
Also change type of nr from int to long, as it counts number of pages.

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
---
 mm/gup_test.c | 23 ++-
 mm/gup_test.h |  3 ++-
 tools/testing/selftests/vm/gup_test.c | 15 +++
 3 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index e3cf78e5873e..a6ed1c877679 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
 {
ktime_t start_time, end_time;
unsigned long i, nr_pages, addr, next;
-   int nr;
+   long nr;
struct page **pages;
int ret = 0;
bool needs_mmap_lock =
@@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
-   /* Filter out most gup flags: only allow a tiny subset here: */
-   gup->flags &= FOLL_WRITE;
-
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags,
+   nr = get_user_pages_fast(addr, nr, gup->gup_flags,
 pages + i);
break;
case GUP_BASIC_TEST:
-   nr = get_user_pages(addr, nr, gup->flags, pages + i,
+   nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
NULL);
break;
case PIN_FAST_BENCHMARK:
-   nr = pin_user_pages_fast(addr, nr, gup->flags,
+   nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
 pages + i);
break;
case PIN_BASIC_TEST:
-   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
NULL);
break;
case PIN_LONGTERM_BENCHMARK:
nr = pin_user_pages(addr, nr,
-   gup->flags | FOLL_LONGTERM,
+   gup->gup_flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case DUMP_USER_PAGES_TEST:
-   if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
-   nr = pin_user_pages(addr, nr, gup->flags,
+   if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)

[PATCH v10 12/14] mm/gup: longterm pin migration cleanup

2021-02-11 Thread Pavel Tatashin

When pages are longterm pinned, we must migrated them out of movable zone.
The function that migrates them has a hidden loop with goto. The loop is
to retry on isolation failures, and after successful migration.

Make this code better by moving this loop to the caller.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 93 ++--
 1 file changed, 37 insertions(+), 56 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 9d303c8e907f..e3df4e0813d6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1541,27 +1541,28 @@ struct page *get_dump_page(unsigned long addr)
 #endif /* CONFIG_ELF_CORE */
 
 #ifdef CONFIG_MIGRATION
-static long check_and_migrate_movable_pages(struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
+/*
+ * Check whether all pages are pinnable, if so return number of pages.  If some
+ * pages are not pinnable, migrate them, and unpin all pages. Return zero if
+ * pages were migrated, or if some pages were not successfully isolated.
+ * Return negative error if migration fails.
+ */
+static long check_and_migrate_movable_pages(unsigned long nr_pages,
struct page **pages,
-   struct vm_area_struct **vmas,
unsigned int gup_flags)
 {
-   unsigned long i, isolation_error_count;
-   bool drain_allow;
+   unsigned long i;
+   unsigned long isolation_error_count = 0;
+   bool drain_allow = true;
LIST_HEAD(movable_page_list);
-   long ret = nr_pages;
-   struct page *prev_head, *head;
+   long ret = 0;
+   struct page *prev_head = NULL;
+   struct page *head;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
};
 
-check_again:
-   prev_head = NULL;
-   isolation_error_count = 0;
-   drain_allow = true;
for (i = 0; i < nr_pages; i++) {
head = compound_head(pages[i]);
if (head == prev_head)
@@ -1599,47 +1600,27 @@ static long check_and_migrate_movable_pages(struct 
mm_struct *mm,
 * in the correct zone.
 */
if (list_empty(_page_list) && !isolation_error_count)
-   return ret;
+   return nr_pages;
 
+   if (gup_flags & FOLL_PIN) {
+   unpin_user_pages(pages, nr_pages);
+   } else {
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   }
if (!list_empty(_page_list)) {
-   /*
-* drop the above get_user_pages reference.
-*/
-   if (gup_flags & FOLL_PIN)
-   unpin_user_pages(pages, nr_pages);
-   else
-   for (i = 0; i < nr_pages; i++)
-   put_page(pages[i]);
-
ret = migrate_pages(_page_list, alloc_migration_target,
NULL, (unsigned long), MIGRATE_SYNC,
MR_LONGTERM_PIN);
-   if (ret) {
-   if (!list_empty(_page_list))
-   putback_movable_pages(_page_list);
-   return ret > 0 ? -ENOMEM : ret;
-   }
-
-   /* We unpinned pages before migration, pin them again */
-   ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
- NULL, gup_flags);
-   if (ret <= 0)
-   return ret;
-   nr_pages = ret;
+   if (ret && !list_empty(_page_list))
+   putback_movable_pages(_page_list);
}
 
-   /*
-* check again because pages were unpinned, and we also might have
-* had isolation errors and need more pages to migrate.
-*/
-   goto check_again;
+   return ret > 0 ? -ENOMEM : ret;
 }
 #else
-static long check_and_migrate_movable_pages(struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
+static long check_and_migrate_movable_pages(unsigned long nr_pages,
struct page **pages,
-   struct vm_area_struct **vmas,
unsigned int gup_flags)
 {
return nr_pages;
@@ -1657,22 +1638,22 @@ static long __gup_longterm_locked(struct mm_struct *mm,
  struct vm_area_struct **vmas,
  unsigned int gup_flags)
 {
-   unsigned long flags = 0;
+   unsigned int flags;
long rc;
 
-   if (gup_fl

[PATCH v10 06/14] mm: apply per-task gfp constraints in fast path

2021-02-11 Thread Pavel Tatashin

Function current_gfp_context() is called after fast path. However, soon we
will add more constraints which will also limit zones based on context.
Move this call into fast path, and apply the correct constraints for all
allocations.

Also update .reclaim_idx based on value returned by current_gfp_context()
because it soon will modify the allowed zones.

Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:

__alloc_pages_nodemask()
 prepare_alloc_pages()
  current_alloc_flags(gfp_mask, *alloc_flags);

Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.

Suggested-by: Michal Hocko 
Signed-off-by: Pavel Tatashin 
Acked-by: Michal Hocko 
---
 mm/page_alloc.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c843dd64a74a..92f1741285c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4982,6 +4982,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order, int preferred_nid,
}
 
gfp_mask &= gfp_allowed_mask;
+   /*
+* Apply scoped allocation constraints. This is mainly about GFP_NOFS
+* resp. GFP_NOIO which has to be inherited for all allocation requests
+* from a particular context which has been marked by
+* memalloc_no{fs,io}_{save,restore}.
+*/
+   gfp_mask = current_gfp_context(gfp_mask);
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, , 
_mask, _flags))
return NULL;
@@ -4997,13 +5004,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order, int preferred_nid,
if (likely(page))
goto out;
 
-   /*
-* Apply scoped allocation constraints. This is mainly about GFP_NOFS
-* resp. GFP_NOIO which has to be inherited for all allocation requests
-* from a particular context which has been marked by
-* memalloc_no{fs,io}_{save,restore}.
-*/
-   alloc_mask = current_gfp_context(gfp_mask);
+   alloc_mask = gfp_mask;
ac.spread_dirty_pages = false;
 
/*
-- 
2.25.1

[PATCH v10 07/14] mm: honor PF_MEMALLOC_PIN for all movable pages

2021-02-11 Thread Pavel Tatashin

PF_MEMALLOC_PIN is only honored for CMA pages, extend
this flag to work for any allocations from ZONE_MOVABLE by removing
__GFP_MOVABLE from gfp_mask when this flag is passed in the current
context.

Add is_pinnable_page() to return true if page is in a pinnable page.
A pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.

Signed-off-by: Pavel Tatashin 
Acked-by: Michal Hocko 
---
 include/linux/mm.h   | 18 ++
 include/linux/sched/mm.h |  6 +-
 mm/hugetlb.c |  2 +-
 mm/page_alloc.c  | 20 +---
 4 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 89fca443e6f1..9a31b2298c1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1122,6 +1122,24 @@ static inline bool is_zone_device_page(const struct page 
*page)
 }
 #endif
 
+static inline bool is_zone_movable_page(const struct page *page)
+{
+   return page_zonenum(page) == ZONE_MOVABLE;
+}
+
+/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
+#ifdef CONFIG_MIGRATION
+static inline bool is_pinnable_page(struct page *page)
+{
+   return !is_zone_movable_page(page) && !is_migrate_cma_page(page);
+}
+#else
+static inline bool is_pinnable_page(struct page *page)
+{
+   return true;
+}
+#endif
+
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5f4dd3274734..a55277b0d475 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -150,12 +150,13 @@ static inline bool in_vfork(struct task_struct *tsk)
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_PIN  implies !GFP_MOVABLE
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
unsigned int pflags = READ_ONCE(current->flags);
 
-   if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) {
+   if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | 
PF_MEMALLOC_PIN))) {
/*
 * NOIO implies both NOIO and NOFS and it is a weaker context
 * so always make sure it makes precedence
@@ -164,6 +165,9 @@ static inline gfp_t current_gfp_context(gfp_t flags)
flags &= ~(__GFP_IO | __GFP_FS);
else if (pflags & PF_MEMALLOC_NOFS)
flags &= ~__GFP_FS;
+
+   if (pflags & PF_MEMALLOC_PIN)
+   flags &= ~__GFP_MOVABLE;
}
return flags;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1d909879c1b4..90c4d279dec4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1047,7 +1047,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
-   if (pin && is_migrate_cma_page(page))
+   if (pin && !is_pinnable_page(page))
continue;
 
if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92f1741285c1..d21d3c12aa31 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3808,16 +3808,13 @@ alloc_flags_nofragment(struct zone *zone, gfp_t 
gfp_mask)
return alloc_flags;
 }
 
-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
-   unsigned int alloc_flags)
+/* Must be called after current_gfp_context() which can change gfp_mask */
+static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
+ unsigned int alloc_flags)
 {
 #ifdef CONFIG_CMA
-   unsigned int pflags = current->flags;
-
-   if (!(pflags & PF_MEMALLOC_PIN) &&
-   gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+   if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
-
 #endif
return alloc_flags;
 }
@@ -4473,7 +4470,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
 
-   alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
+   alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
 
return alloc_flags;
 }
@@ -4775,7 +4772,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
-   alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
+   alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
 
/*
 * Reset the nodemask and zonelist iterators if memory policies can be
@@ -4944,7 +4941,7 @@ static inline bool prepare_allo

[PATCH v10 09/14] mm/gup: migrate pinned pages out of movable zone

2021-02-11 Thread Pavel Tatashin

We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only
movable CMA pages. Generalize the function that migrates CMA pages to
migrate all movable pages. Use is_pinnable_page() to check which
pages need to be migrated

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
---
 include/linux/migrate.h|  1 +
 include/linux/mmzone.h |  9 -
 include/trace/events/migrate.h |  3 +-
 mm/gup.c   | 67 +-
 4 files changed, 44 insertions(+), 36 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..fdf65f23acec 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+   MR_LONGTERM_PIN,
MR_TYPES
 };
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 66132f8f051e..5e0f79a4092b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -407,8 +407,13 @@ enum zone_type {
 * to increase the number of THP/huge pages. Notable special cases are:
 *
 * 1. Pinned pages: (long-term) pinning of movable pages might
-*essentially turn such pages unmovable. Memory offlining might
-*retry a long time.
+*essentially turn such pages unmovable. Therefore, we do not allow
+*pinning long-term pages in ZONE_MOVABLE. When pages are pinned and
+*faulted, they come from the right zone right away. However, it is
+*still possible that address space already has pages in
+*ZONE_MOVABLE at the time when pages are pinned (i.e. user has
+*touches that memory before pinning). In such case we migrate them
+*to a different zone. When migration fails - pinning fails.
 * 2. memblock allocations: kernelcore/movablecore setups might create
 *situations where ZONE_MOVABLE contains unmovable allocations
 *after boot. Memory offlining and allocations fail early.
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..363b54ce104c 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset")\
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
-   EMe(MR_CONTIG_RANGE,"contig_range")
+   EM( MR_CONTIG_RANGE,"contig_range") \
+   EMe(MR_LONGTERM_PIN,"longterm_pin")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff --git a/mm/gup.c b/mm/gup.c
index 489bc02fc008..ced2303dc59e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -87,11 +87,12 @@ __maybe_unused struct page *try_grab_compound_head(struct 
page *page,
int orig_refs = refs;
 
/*
-* Can't do FOLL_LONGTERM + FOLL_PIN with CMA in the gup fast
-* path, so fail and let the caller fall back to the slow path.
+* Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
+* right zone, so fail and let the caller fall back to the slow
+* path.
 */
-   if (unlikely(flags & FOLL_LONGTERM) &&
-   is_migrate_cma_page(page))
+   if (unlikely((flags & FOLL_LONGTERM) &&
+!is_pinnable_page(page)))
return NULL;
 
/*
@@ -1539,17 +1540,17 @@ struct page *get_dump_page(unsigned long addr)
 }
 #endif /* CONFIG_ELF_CORE */
 
-#ifdef CONFIG_CMA
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
-   struct page **pages,
-   struct vm_area_struct **vmas,
-   unsigned int gup_flags)
+#ifdef CONFIG_MIGRATION
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+   unsigned long start,
+   unsigned long nr_pages,
+   struct page **pages,
+   struct vm_area_struct **vmas,
+   unsigned int gup_flags)
 {
unsigned long i, isolation_error_count;
bool drain_allow;
-   LIST_HEAD(cma_page_list);
+   LIST_HEAD(movable_page_list);
long ret = nr_pages;
struct page *prev_head, *head;
struct migration_target_control mtc = {
@@ -1567,13 +1568,12 @@ static long

[PATCH v10 08/14] mm/gup: do not migrate zero page

2021-02-11 Thread Pavel Tatashin

On some platforms ZERO_PAGE(0) might end-up in a movable zone. Do not
migrate zero page in gup during longterm pinning as migration of zero page
is not allowed.

For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I
see the following:

Boot#1: zero_pfn  0x48a8d zero_pfn zone: ZONE_DMA32
Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE

On x86, empty_zero_page is declared in .bss and depending on the loader
may end up in different physical locations during boots.

Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because
zero_pfn that they are using is declared in memory.c which is compiled
with CONFIG_MMU.

Signed-off-by: Pavel Tatashin 
---
 include/linux/mm.h  |  3 ++-
 include/linux/mmzone.h  |  4 
 include/linux/pgtable.h | 12 
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a31b2298c1d..9ea4b9305ae5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1131,7 +1131,8 @@ static inline bool is_zone_movable_page(const struct page 
*page)
 #ifdef CONFIG_MIGRATION
 static inline bool is_pinnable_page(struct page *page)
 {
-   return !is_zone_movable_page(page) && !is_migrate_cma_page(page);
+   return !(is_zone_movable_page(page) || is_migrate_cma_page(page)) ||
+   is_zero_pfn(page_to_pfn(page));
 }
 #else
 static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..66132f8f051e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -427,6 +427,10 @@ enum zone_type {
 *techniques might use alloc_contig_range() to hide previously
 *exposed pages from the buddy again (e.g., to implement some sort
 *of memory unplug in virtio-mem).
+* 6. ZERO_PAGE(0), kernelcore/movablecore setups might create
+*situations where ZERO_PAGE(0) which is allocated differently
+*on different platforms may end up in a movable zone. ZERO_PAGE(0)
+*cannot be migrated.
 *
 * In general, no unmovable allocations that degrade memory offlining
 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdfc4e9f253e..9a218d7eed06 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1107,6 +1107,7 @@ extern void untrack_pfn(struct vm_area_struct *vma, 
unsigned long pfn,
 extern void untrack_pfn_moved(struct vm_area_struct *vma);
 #endif
 
+#ifdef CONFIG_MMU
 #ifdef __HAVE_COLOR_ZERO_PAGE
 static inline int is_zero_pfn(unsigned long pfn)
 {
@@ -1130,6 +1131,17 @@ static inline unsigned long my_zero_pfn(unsigned long 
addr)
return zero_pfn;
 }
 #endif
+#else
+static inline int is_zero_pfn(unsigned long pfn)
+{
+   return 0;
+}
+
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+   return 0;
+}
+#endif /* CONFIG_MMU */
 
 #ifdef CONFIG_MMU
 
-- 
2.25.1

[PATCH v10 10/14] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning

2021-02-11 Thread Pavel Tatashin

Document the special handling of page pinning when ZONE_MOVABLE present.

Signed-off-by: Pavel Tatashin 
Suggested-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 Documentation/admin-guide/mm/memory-hotplug.rst | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst 
b/Documentation/admin-guide/mm/memory-hotplug.rst
index 5307f90738aa..05d51d2d8beb 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -357,6 +357,15 @@ creates ZONE_MOVABLE as following.
Unfortunately, there is no information to show which memory block belongs
to ZONE_MOVABLE. This is TBD.
 
+.. note::
+   Techniques that rely on long-term pinnings of memory (especially, RDMA and
+   vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
+   hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
+   memory can still get hot removed - be aware that pinning can fail even if
+   there is plenty of free memory in ZONE_MOVABLE. In addition, using
+   ZONE_MOVABLE might make page pinning more expensive, because pages have to 
be
+   migrated off that zone first.
+
 .. _memory_hotplug_how_to_offline_memory:
 
 How to offline memory
-- 
2.25.1

[PATCH v10 05/14] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

2021-02-11 Thread Pavel Tatashin

PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return
pages that might belong to CMA region. This is currently used for long
term gup to make sure that such pins are not going to be done on any CMA
pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is
focusing on CMA pages too much and that there is larger class of pages that
need the same treatment. MOVABLE zone cannot contain any long term pins as
well so it makes sense to reuse and redefine this flag for that usecase as
well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation
context which can only get pages suitable for long-term pins.

Also re-name:
memalloc_nocma_save()/memalloc_nocma_restore
to
memalloc_pin_save()/memalloc_pin_restore()
and make the new functions common.

Signed-off-by: Pavel Tatashin 
Reviewed-by: John Hubbard 
Acked-by: Michal Hocko 
---
 include/linux/sched.h|  2 +-
 include/linux/sched/mm.h | 21 +
 mm/gup.c |  4 ++--
 mm/hugetlb.c |  4 ++--
 mm/page_alloc.c  |  4 ++--
 5 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 96837243931a..62c639d67fe3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1576,7 +1576,7 @@ extern struct pid *cad_pid;
 #define PF_SWAPWRITE   0x0080  /* Allowed to write to swap */
 #define PF_NO_SETAFFINITY  0x0400  /* Userland is not allowed to 
meddle with cpus_mask */
 #define PF_MCE_EARLY   0x0800  /* Early kill for mce process 
policy */
-#define PF_MEMALLOC_NOCMA  0x1000  /* All allocation request will 
have _GFP_MOVABLE cleared */
+#define PF_MEMALLOC_PIN0x1000  /* Allocation context 
constrained to zones which allow long term pinning. */
 #define PF_FREEZER_SKIP0x4000  /* Freezer should not 
count it as freezable */
 #define PF_SUSPEND_TASK0x8000  /* This thread called 
freeze_processes() and should not be frozen */
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1ae08b8462a4..5f4dd3274734 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -270,29 +270,18 @@ static inline void memalloc_noreclaim_restore(unsigned 
int flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
 }
 
-#ifdef CONFIG_CMA
-static inline unsigned int memalloc_nocma_save(void)
+static inline unsigned int memalloc_pin_save(void)
 {
-   unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+   unsigned int flags = current->flags & PF_MEMALLOC_PIN;
 
-   current->flags |= PF_MEMALLOC_NOCMA;
+   current->flags |= PF_MEMALLOC_PIN;
return flags;
 }
 
-static inline void memalloc_nocma_restore(unsigned int flags)
+static inline void memalloc_pin_restore(unsigned int flags)
 {
-   current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+   current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
 }
-#else
-static inline unsigned int memalloc_nocma_save(void)
-{
-   return 0;
-}
-
-static inline void memalloc_nocma_restore(unsigned int flags)
-{
-}
-#endif
 
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
diff --git a/mm/gup.c b/mm/gup.c
index b1f6d56182b3..489bc02fc008 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1661,7 +1661,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
long rc;
 
if (gup_flags & FOLL_LONGTERM)
-   flags = memalloc_nocma_save();
+   flags = memalloc_pin_save();
 
rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
 gup_flags);
@@ -1670,7 +1670,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
if (rc > 0)
rc = check_and_migrate_cma_pages(mm, start, rc, pages,
 vmas, gup_flags);
-   memalloc_nocma_restore(flags);
+   memalloc_pin_restore(flags);
}
return rc;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0b7079dd0d35..1d909879c1b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1044,10 +1044,10 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 {
struct page *page;
-   bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
+   bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
list_for_each_entry(page, >hugepage_freelists[nid], lru) {
-   if (nocma && is_migrate_cma_page(page))
+   if (pin && is_migrate_cma_page(page))
continue;
 
if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b55c9c95364..c8

[PATCH v10 11/14] mm/gup: change index type to long as it counts pages

2021-02-11 Thread Pavel Tatashin

In __get_user_pages_locked() i counts number of pages which should be
long, as long is used in all other places to contain number of pages, and
32-bit becomes increasingly small for handling page count proportional
values.

Signed-off-by: Pavel Tatashin 
Acked-by: Michal Hocko 
---
 mm/gup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index ced2303dc59e..9d303c8e907f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1471,7 +1471,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, 
unsigned long start,
 {
struct vm_area_struct *vma;
unsigned long vm_flags;
-   int i;
+   long i;
 
/* calculate required read or write permissions.
 * If FOLL_FORCE is set, we only require the "MAY" flags.
-- 
2.25.1

[PATCH v10 00/14] prohibit pinning pages in ZONE_MOVABLE

2021-02-11 Thread Pavel Tatashin

Changelog
-
v10
- Fixed !CONFIG_MMU compiler issues by adding is_zero_pfn() stub.

v9
- Renamed gpf_to_alloc_flags() to gfp_to_alloc_flags_cma(); thanks Lecopzer
  Chen for noticing.
- Fixed warning reported scripts/checkpatch.pl:
  "Logical continuations should be on the previous line"

v8
- Added reviewed by's from John Hubbard
- Fixed subjects for selftests patches
- Moved zero page check inside is_pinnable_page() as requested by Jason
  Gunthorpe.

v7
- Added reviewed-by's
- Fixed a compile bug on non-mmu builds reported by robot

v6
  Small update, but I wanted to send it out quicker, as it removes a
  controversial patch and replaces it with something sane.
- Removed forcing FOLL_WRITE for longterm gup, instead added a patch to
  skip zero pages during migration.
- Added reviewed-by's and minor log changes.

v5
- Added the following patches to the beginning of series, which are fixes
   to the other existing problems with CMA migration code:
mm/gup: check every subpage of a compound page during isolation
mm/gup: return an error on migration failure
mm/gup: check for isolation errors also at the beginning of series
mm/gup: do not allow zero page for pinned pages
- remove .gfp_mask/.reclaim_idx changes from mm/vmscan.c
- update movable zone header comment in patch 8 instead of patch 3, fix
  the comment
- Added acked, sign-offs
- Updated commit logs based on feedback
- Addressed issues reported by Michal and Jason.
- Remove:
#define PINNABLE_MIGRATE_MAX10
#define PINNABLE_ISOLATE_MAX100
   Instead: fail on the first migration failure, and retry isolation
   forever as their failures are transient.

- In self-set addressed some of the comments from John Hubbard, updated
  commit logs, and added comments. Renamed gup->flags with gup->test_flags.

v4
- Address page migration comments. New patch:
  mm/gup: limit number of gup migration failures, honor failures
  Implements the limiting number of retries for migration failures, and
  also check for isolation failures.
  Added a test case into gup_test to verify that pages never long-term
  pinned in a movable zone, and also added tests to fault both in kernel
  and in userland.
v3
- Merged with linux-next, which contains clean-up patch from Jason,
  therefore this series is reduced by two patches which did the same
  thing.
v2
- Addressed all review comments
- Added Reviewed-by's.
- Renamed PF_MEMALLOC_NOMOVABLE to PF_MEMALLOC_PIN
- Added is_pinnable_page() to check if page can be longterm pinned
- Fixed gup fast path by checking is_in_pinnable_zone()
- rename cma_page_list to movable_page_list
- add a admin-guide note about handling pinned pages in ZONE_MOVABLE,
  updated caveat about pinned pages from linux/mmzone.h
- Move current_gfp_context() to fast-path

-
When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.

This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().

However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.

This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.

It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.

For more information read the discussion [1] about this problem.
[1] 
https://lore.kernel.org/lkml/ca+ck2bbffhbxjmb9jmskacm0fjminyt3nhk8nx6iudcqsj8...@mail.gmail.com

Previous versions:
v1
https://lore.kernel.org/lkml/20201202052330.474592-1-pasha.tatas...@soleen.com
v2
https://lore.kernel.org/lkml/20201210004335.64634-1-pasha.tatas...@soleen.com
v3
https://lore.kernel.org/lkml/20201211202140.396852-1-pasha.tatas...@soleen.com
v4
https://lore.kernel.org/lkml/20201217185243.3288048-1-pasha.tatas...@soleen.com
v5
https://lore.kernel.org/lkml/20210119043920.155044-1-pasha.tatas...@soleen.com
v6
https://lore.kernel.org/lkml/20210120014333.222547-1-pasha.tatas...@soleen.com
v7
https://lore.kernel.org/lkml/20210122033748.924330-1-pasha.tatas...@soleen.com
v8
https://lore.kernel.org/lkml/20210125194751.1275316-1-pasha.tatas...@soleen.com
v9
https://lore.kernel.org/lkml/20210201153827.444374-1-pasha.tatas...@soleen.com

Pavel Tatashin (14):
  mm/gup: don't pin migrated cma pages in movable zone
  mm/gup: check every subpage of a compound page during isolation
  mm/gup: return an error on migration failure
  mm/gup: check for isolation errors
  mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
  mm: apply per-task gfp constraints in fast path
  mm: honor PF_MEMALLOC_PIN for all movable pages
  mm/gup: do not migrate zero page
  mm/gup: migrate pinned pages out of movable zone
  memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning
  mm/gup: change index type to l

[PATCH v10 02/14] mm/gup: check every subpage of a compound page during isolation

2021-02-11 Thread Pavel Tatashin

When pages are isolated in check_and_migrate_movable_pages() we skip
compound number of pages at a time. However, as Jason noted, it is
not necessary correct that pages[i] corresponds to the pages that
we skipped. This is because it is possible that the addresses in
this range had split_huge_pmd()/split_huge_pud(), and these functions
do not update the compound page metadata.

The problem can be reproduced if something like this occurs:

1. User faulted huge pages.
2. split_huge_pmd() was called for some reason
3. User has unmapped some sub-pages in the range
4. User tries to longterm pin the addresses.

The resulting pages[i] might end-up having pages which are not compound
size page aligned.

Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge 
page")
Reported-by: Jason Gunthorpe 
Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 88441de65e34..1f73cbf7fb37 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1548,26 +1548,23 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
unsigned int gup_flags)
 {
unsigned long i;
-   unsigned long step;
bool drain_allow = true;
bool migrate_allow = true;
LIST_HEAD(cma_page_list);
long ret = nr_pages;
+   struct page *prev_head, *head;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
};
 
 check_again:
-   for (i = 0; i < nr_pages;) {
-
-   struct page *head = compound_head(pages[i]);
-
-   /*
-* gup may start from a tail page. Advance step by the left
-* part.
-*/
-   step = compound_nr(head) - (pages[i] - head);
+   prev_head = NULL;
+   for (i = 0; i < nr_pages; i++) {
+   head = compound_head(pages[i]);
+   if (head == prev_head)
+   continue;
+   prev_head = head;
/*
 * If we get a page from the CMA zone, since we are going to
 * be pinning these entries, we might as well move them out
@@ -1591,8 +1588,6 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
}
}
}
-
-   i += step;
}
 
if (!list_empty(_page_list)) {
-- 
2.25.1

[PATCH v10 04/14] mm/gup: check for isolation errors

2021-02-11 Thread Pavel Tatashin

It is still possible that we pin movable CMA pages if there are isolation
errors and cma_page_list stays empty when we check again.

Check for isolation errors, and return success only when there are no
isolation errors, and cma_page_list is empty after checking.

Because isolation errors are transient, we retry indefinitely.

Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages 
allocated from CMA region")
Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 60 
 1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index eb8c39953d53..b1f6d56182b3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1547,8 +1547,8 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
struct vm_area_struct **vmas,
unsigned int gup_flags)
 {
-   unsigned long i;
-   bool drain_allow = true;
+   unsigned long i, isolation_error_count;
+   bool drain_allow;
LIST_HEAD(cma_page_list);
long ret = nr_pages;
struct page *prev_head, *head;
@@ -1559,6 +1559,8 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
 
 check_again:
prev_head = NULL;
+   isolation_error_count = 0;
+   drain_allow = true;
for (i = 0; i < nr_pages; i++) {
head = compound_head(pages[i]);
if (head == prev_head)
@@ -1570,25 +1572,35 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
 * of the CMA zone if possible.
 */
if (is_migrate_cma_page(head)) {
-   if (PageHuge(head))
-   isolate_huge_page(head, _page_list);
-   else {
+   if (PageHuge(head)) {
+   if (!isolate_huge_page(head, _page_list))
+   isolation_error_count++;
+   } else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
drain_allow = false;
}
 
-   if (!isolate_lru_page(head)) {
-   list_add_tail(>lru, 
_page_list);
-   mod_node_page_state(page_pgdat(head),
-   NR_ISOLATED_ANON +
-   
page_is_file_lru(head),
-   thp_nr_pages(head));
+   if (isolate_lru_page(head)) {
+   isolation_error_count++;
+   continue;
}
+   list_add_tail(>lru, _page_list);
+   mod_node_page_state(page_pgdat(head),
+   NR_ISOLATED_ANON +
+   page_is_file_lru(head),
+   thp_nr_pages(head));
}
}
}
 
+   /*
+* If list is empty, and no isolation errors, means that all pages are
+* in the correct zone.
+*/
+   if (list_empty(_page_list) && !isolation_error_count)
+   return ret;
+
if (!list_empty(_page_list)) {
/*
 * drop the above get_user_pages reference.
@@ -1608,23 +1620,19 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
return ret > 0 ? -ENOMEM : ret;
}
 
-   /*
-* We did migrate all the pages, Try to get the page references
-* again migrating any new CMA pages which we failed to isolate
-* earlier.
-*/
-   ret = __get_user_pages_locked(mm, start, nr_pages,
-  pages, vmas, NULL,
-  gup_flags);
-
-   if (ret > 0) {
-   nr_pages = ret;
-   drain_allow = true;
-   goto check_again;
-   }
+   /* We unpinned pages before migration, pin them again */
+   ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+ NULL, gup_flags);
+   if (ret <= 0)
+   return ret;
+   nr_pages = ret;
}
 
-   return ret;
+   /*
+* check again because pages were unpinned, and we also might have
+* had iso

[PATCH v10 01/14] mm/gup: don't pin migrated cma pages in movable zone

2021-02-11 Thread Pavel Tatashin

In order not to fragment CMA the pinned pages are migrated. However,
they are migrated to ZONE_MOVABLE, which also should not have pinned pages.

Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
is allowed.

Signed-off-by: Pavel Tatashin 
Reviewed-by: David Hildenbrand 
Reviewed-by: John Hubbard 
Acked-by: Michal Hocko 
---
 mm/gup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..88441de65e34 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1555,7 +1555,7 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
-   .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
+   .gfp_mask = GFP_USER | __GFP_NOWARN,
};
 
 check_again:
-- 
2.25.1

[PATCH v10 03/14] mm/gup: return an error on migration failure

2021-02-11 Thread Pavel Tatashin

When migration failure occurs, we still pin pages, which means
that we may pin CMA movable pages which should never be the case.

Instead return an error without pinning pages when migration failure
happens.

No need to retry migrating, because migrate_pages() already retries
10 times.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1f73cbf7fb37..eb8c39953d53 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1549,7 +1549,6 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
 {
unsigned long i;
bool drain_allow = true;
-   bool migrate_allow = true;
LIST_HEAD(cma_page_list);
long ret = nr_pages;
struct page *prev_head, *head;
@@ -1600,17 +1599,15 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
 
-   if (migrate_pages(_page_list, alloc_migration_target, NULL,
-   (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE)) {
-   /*
-* some of the pages failed migration. Do get_user_pages
-* without migration.
-*/
-   migrate_allow = false;
-
+   ret = migrate_pages(_page_list, alloc_migration_target,
+   NULL, (unsigned long), MIGRATE_SYNC,
+   MR_CONTIG_RANGE);
+   if (ret) {
if (!list_empty(_page_list))
putback_movable_pages(_page_list);
+   return ret > 0 ? -ENOMEM : ret;
}
+
/*
 * We did migrate all the pages, Try to get the page references
 * again migrating any new CMA pages which we failed to isolate
@@ -1620,7 +1617,7 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
   pages, vmas, NULL,
   gup_flags);
 
-   if ((ret > 0) && migrate_allow) {
+   if (ret > 0) {
nr_pages = ret;
drain_allow = true;
goto check_again;
-- 
2.25.1

improving crash dump discussion

2021-02-10 Thread Pavel Tatashin

I would like to start a discussion about how we can improve Linux
crash dump facility, and use warm reboot  / firmware assistance in
order to more reliably collect crash dumps while using fewer
memory resources and being more performant.

Currently, the main way to collect crash dumps on Linux is to use
kdump.  Kdump uses kexec in order to collect dumps. Kdump makes
use of kexec, which is mature and portable (does not depend on
firmware), but using kexec is not ideal.

I will list some problems with kexec/kdump, and then discuss how some
of them (hopefully most) can be addressed.

1. Expecting a crashing kernel to do the right thing: properly quiesce
devices, CPUs and prepare the machine for the new kernel.

The amount of code that is executed to perform crash kexec reboot is
not trivial. Unfortunately, since we are panicking we already lost
control at some point and the goal would be to reduce the amount of
code executed by the panic handler in order to be able to reliably
collect dumps. There are some ways to improve the reliability of crash
kexec reboot. For example, passing maxcpus=1 kernel parameter is now
the required on almost all platforms, which, unfortunately, has the
downside of forcing crash kernel to use only a single thread to save
core, and thus "makedumpfile --num-thread" is useless if used from
crash kernel.

2. Unlike booting from firmware, the PCI, CPUs, interrupt controllers,
DMAs mappings, and I/O devices are not reinitialized and might not be
in a consistent state.

The reset_devices, irqpoll, and other kernel parameters also intend to
mitigate these shortfalls by requiring drivers to do the resetting
themselves. Also, the kernel is usually smart enough to ignore
spurious interrupts, but this is fragile.

3. There is a blackout window during boot where collecting a crash dump
is not possible.

With current kdump it is possible to collect crashes that occur after
the kernel early boot is finished. During early boot we do a lot:
determine platform, initialize mm, initialize clock, scheduler, and
start other CPUs. Only after entering usermode, we are able to kexec
load crash kernel into memory after which crash can be collected.

4. Kdump is not compatible with hardware watchdog resets

When a hardware watchdog causes a reset, software is not involved, and
therefore we lose the entire machine state.

5. Crash kernel requires memory reservation

Crash kernel can't use the memory that was used by the crashing
kernel, therefore memory must always be reserved that is wasted during
normal operation, and only contains the image of the crash kernel.

6. Crash kernel requires special image and two reboots

Special crash image is usually required to reduce the number of loaded
modules, and also to reduce the system to the bare minimum so that it
can be booted in the small reserved space. Also, after the crash
kernel collects the core dump, we reboot back to the normal kernel,
thus two reboots are needed in order to recover after the crash.

==

On the other hand, powerpc can optionally use firmware assisted
kdump (fadump). The benefits of fadump:

1. reboot through firmware happens, and thus all devices are reset to
their initial state

2. memory for the crash kernel does not need to be reserved if CMA is
used and user pages do not need to be preserved (commonly there is no
need to preserve user pages to debug kernel panics).

3. fadump crash format is identical to kdump (ELF /proc/vmcore),
therefore tools are the same, i.e. crash(8), makedumpfile, and other
all can be used.

4. No need to have a special crash kernel image and no need to do a
second reboot from the crash kernel.

The following services are expected from firmware in order for fadump to work:

1. Ability to do warm reboot

Preserve memory content across reboot. Firmware must not zero
(initialize) memory content. From my experience, this is actually
common nowadays: I see this happens on my AMD desktop with x570 chip +
UEFI BIOS, we do this at Microsoft both on larger Xeon servers with
UEFI firmware, and on small arm64 devices which use device trees instead
of EFI for performance reasons, and also to preserve emulated pmem
devices across reboot. We also did it at Oracle on SPARC sun4v machines
where sun4v hypervisor would not reset memory content on every reboot for
performance reasons.

2. Ability to register preserved memory region with firmware

The first kernel uses firmware to reserve a region of memory that must
be preserved when rebooted. Firmware and bootloader must not allocate
from preserved regions.

3. Ability to copy boot memory source to destination.

On powerpc, boot must start from a lower address, similar like on x86.
Also, boot memory is a region of memory that can be used by the kernel
to boot, and the rest is added later once the kernel decides to
unreserve it: i.e. after vmcore is saved.

The copy boot memory is not strictly necessary: the

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2213 matches

Mail list logo