date:20210701

Re: [PATCH] powerpc/mm: Fix lockup on kernel exec fault

2021-07-01 Thread Christophe Leroy

Le 02/07/2021 à 03:25, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of July 1, 2021 9:17 pm:

The powerpc kernel is not prepared to handle exec faults from kernel.
Especially, the function is_exec_fault() will return 'false' when an
exec fault is taken by kernel, because the check is based on reading
current->thread.regs->trap which contains the trap from user.

For instance, when provoking a LKDTM EXEC_USERSPACE test,
current->thread.regs->trap is set to SYSCALL trap (0xc00), and
the fault taken by the kernel is not seen as an exec fault by
set_access_flags_filter().

Commit d7df2443cd5f ("powerpc/mm: Fix spurrious segfaults on radix
with autonuma") made it clear and handled it properly. But later on
commit d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute
faults") removed that handling, introducing test based on error_code.
And here is the problem, because on the 603 all upper bits of SRR1
get cleared when the TLB instruction miss handler bails out to ISI.

So the problem is 603 doesn't see the DSISR_NOEXEC_OR_G bit?

I a way yes. But the problem is also that the kernel doesn't see it as an exec fault in
set_access_flags_filter() as explained above. If it could see it as an exec fault, it would set
PAGE_EXEC and it would work (or maybe not because it seems it also checks for the dirtiness of the
page, and here the page is also flagged as dirty).

603 will see DSISR_NOEXEC_OR_G if it's an access to a page which is in a
segment flagged NX.

I don't see the problem with this for 64s, I don't think anything sane
can be done for any 0x400 interrupt in the kernel so it's probably
good to catch all here just in case. For 64s,

Acked-by: Nicholas Piggin

Why is 32s clearing those top bits? And it seems to be setting DSISR
that AFAIKS it does not use. Seems like it would be good to add a
NOEXEC_OR_G bit into SRR1.

Probably for simplicity.

When taking the Instruction TLB Miss interrupt, SRR1 contains CR0 fields in bits 0-3 and some
dedicated info in bits 12-15. That doesn't match SRR1 bits for ISI, so before falling back to the
ISI handler, ITLB Miss handler error patch clears upper SRR1 bits.

Maybe it could instead try to set the right bits, but it would make it more complicated because the
error patch can be taken for the following reasons:

- No page table
- Not PAGE_PRESENT
- Not PAGE_ACCESSED
- Not PAGE_EXEC
- Below TASK_SIZE and not PAGE_USER

At the time being the verification of the flags is done with a single 'andc' operation. If we wanted
to set the proper bits, it would mean testing the flags separately, which would impact performance
on the no-error path.

Or maybe it would be good enough to set the PROTFAULT bit in all cases but the lack of page table.
The 8xx sets PROTFAULT when hitting non-exec pages, so the kernel is prepared for it anyway. Not
sure about the lack of PAGE_PRESENT thought. The 8xx sets NOHPTE bit when PAGE_PRESENT is cleared.

But is it really worth doing ?

Christophe

Re: [PATCH v15 12/12] of: Add plumbing for restricted DMA pool

2021-07-01 Thread Guenter Roeck

Hi,

On Thu, Jun 24, 2021 at 11:55:26PM +0800, Claire Chang wrote:
> If a device is not behind an IOMMU, we look up the device node and set
> up the restricted DMA when the restricted-dma-pool is presented.
> 
> Signed-off-by: Claire Chang 
> Tested-by: Stefano Stabellini 
> Tested-by: Will Deacon 

With this patch in place, all sparc and sparc64 qemu emulations
fail to boot. Symptom is that the root file system is not found.
Reverting this patch fixes the problem. Bisect log is attached.

Guenter

---
# bad: [fb0ca446157a86b75502c1636b0d81e642fe6bf1] Add linux-next specific files 
for 20210701
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect start 'HEAD' 'v5.13'
# bad: [f63c4fda987a19b1194cc45cb72fd5bf968d9d90] Merge remote-tracking branch 
'rdma/for-next'
git bisect bad f63c4fda987a19b1194cc45cb72fd5bf968d9d90
# good: [46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4] Merge remote-tracking branch 
'ipsec/master'
git bisect good 46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4
# good: [43ba6969cfb8185353a7a6fc79070f13b9e3d6d3] Merge remote-tracking branch 
'clk/clk-next'
git bisect good 43ba6969cfb8185353a7a6fc79070f13b9e3d6d3
# good: [1ca5eddcf8dca1d6345471c6404e7364af0d7019] Merge remote-tracking branch 
'fuse/for-next'
git bisect good 1ca5eddcf8dca1d6345471c6404e7364af0d7019
# good: [8f6d7b3248705920187263a4e7147b0752ec7dcf] Merge remote-tracking branch 
'pci/next'
git bisect good 8f6d7b3248705920187263a4e7147b0752ec7dcf
# good: [df1885a755784da3ef285f36d9230c1d090ef186] RDMA/rtrs_clt: Alloc less 
memory with write path fast memory registration
git bisect good df1885a755784da3ef285f36d9230c1d090ef186
# good: [93d31efb58c8ad4a66bbedbc2d082df458c04e45] Merge remote-tracking branch 
'cpufreq-arm/cpufreq/arm/linux-next'
git bisect good 93d31efb58c8ad4a66bbedbc2d082df458c04e45
# good: [46308965ae6fdc7c25deb2e8c048510ae51bbe66] RDMA/irdma: Check contents 
of user-space irdma_mem_reg_req object
git bisect good 46308965ae6fdc7c25deb2e8c048510ae51bbe66
# good: [6de7a1d006ea9db235492b288312838d6878385f] 
thermal/drivers/int340x/processor_thermal: Split enumeration and processing part
git bisect good 6de7a1d006ea9db235492b288312838d6878385f
# good: [081bec2577cda3d04f6559c60b6f4e2242853520] dt-bindings: of: Add 
restricted DMA pool
git bisect good 081bec2577cda3d04f6559c60b6f4e2242853520
# good: [bf95ac0bcd69979af146852f6a617a60285ebbc1] Merge remote-tracking branch 
'thermal/thermal/linux-next'
git bisect good bf95ac0bcd69979af146852f6a617a60285ebbc1
# good: [3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc] RDMA/core: Always release 
restrack object
git bisect good 3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc
# bad: [cff1f23fad6e0bd7d671acce0d15285c709f259c] Merge remote-tracking branch 
'swiotlb/linux-next'
git bisect bad cff1f23fad6e0bd7d671acce0d15285c709f259c
# bad: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing for 
restricted DMA pool
git bisect bad b655006619b7bccd0dc1e055bd72de5d613e7b5c
# first bad commit: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing 
for restricted DMA pool

[PATCH v2] Documentation: PCI: pci-error-recovery: swap sequence between MMIO Enabled and Link Reset

2021-07-01 Thread Wesley Sheng

Reset_link() callback function (named with reset_subordinates()
in pcie_do_recovery() function) was called before mmio_enabled(),
so exchange the sequence between step 2 MMIO Enabled and step 3
Link Reset accordingly.

Signed-off-by: Wesley Sheng 
---
 Documentation/PCI/pci-error-recovery.rst | 25 
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst 
b/Documentation/PCI/pci-error-recovery.rst
index 187f43a03200..0e2f3f77bf0a 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -157,7 +157,7 @@ drivers.
 If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
 then the platform should re-enable IOs on the slot (or do nothing in
 particular, if the platform doesn't isolate slots), and recovery
-proceeds to STEP 2 (MMIO Enable).
+proceeds to STEP 3 (MMIO Enable).
 
 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
 then recovery proceeds to STEP 4 (Slot Reset).
@@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure).
and prints an error to syslog.  A reboot is then required to
get the device working again.
 
-STEP 2: MMIO Enabled
+STEP 2: Link Reset
+--
+The platform resets the link.  This is a PCI-Express specific step
+and is done whenever a fatal error has been detected that can be
+"solved" by resetting the link.
+
+
+STEP 3: MMIO Enabled
 
 The platform re-enables MMIO to the device (but typically not the
 DMA), and then calls the mmio_enabled() callback on all affected
@@ -197,8 +204,8 @@ information, if any, and eventually do things like trigger 
a device local
 reset or some such, but not restart operations. This callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
-without a slot reset or a link reset, it will not call this callback, and
-instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+without a slot reset, it will not call this callback, and
+instead will have gone directly to STEP 4 (Slot Reset)
 
 .. note::
 
@@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) or 
STEP 4 (Slot Reset)
such an error might cause IOs to be re-blocked for the whole
segment, and thus invalidate the recovery that other devices
on the same segment might have done, forcing the whole segment
-   into one of the next states, that is, link reset or slot reset.
+   into the next states, that is, slot reset.
 
 The driver should return one of the following result codes:
   - PCI_ERS_RESULT_RECOVERED
@@ -233,17 +240,11 @@ The driver should return one of the following result 
codes:
 
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
-proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
+proceeds to STEP 5 (Resume Operations).
 
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 
-STEP 3: Link Reset
---
-The platform resets the link.  This is a PCI-Express specific step
-and is done whenever a fatal error has been detected that can be
-"solved" by resetting the link.
-
 STEP 4: Slot Reset
 --
 
-- 
2.25.1

Re: [PATCH] Documentation: PCI: pci-error-recovery: rearrange the general sequence

2021-07-01 Thread Wesley Sheng

On Thu, Jul 01, 2021 at 05:22:31PM -0500, Bjorn Helgaas wrote:
> Please make the subject a little more specific.  "rearrange the
> general sequence" doesn't say anything about what was affected.
> 
> On Fri, Jun 18, 2021 at 02:04:46PM +0800, Wesley Sheng wrote:
> > Reset_link() callback function was called before mmio_enabled() in
> > pcie_do_recovery() function actually, so rearrange the general
> > sequence betwen step 2 and step 3 accordingly.
> 
> s/betwen/between/
> 
> Not sure "general" adds anything in this sentence.  "Step 2 and step
> 3" are not meaningful here in the commit log.  It needs to spell out
> what those steps are so the log makes sense by itself.
> 
> "reset_link" does not appear in pcie_do_recovery().  I'm guessing
> you're referring to the "reset_subordinates" function pointer?
>
Yes, you are right.
pcieaer-howto.rst has a section named with "Provide callbacks",
the callback supplied to pcie_do_recovery() was referred to 
reset_link.
 
> > Signed-off-by: Wesley Sheng 
> 
> I didn't quite understand your response to Oliver, so I'll wait for
> your corrections and his ack before proceeding.
>
OK.
I thought step 2 MMIO Enabled and step 3 link reset should swap sequence.

> > ---
> >  Documentation/PCI/pci-error-recovery.rst | 23 ---
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> > 
> > diff --git a/Documentation/PCI/pci-error-recovery.rst 
> > b/Documentation/PCI/pci-error-recovery.rst
> > index 187f43a03200..ac6a8729ef28 100644
> > --- a/Documentation/PCI/pci-error-recovery.rst
> > +++ b/Documentation/PCI/pci-error-recovery.rst
> > @@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure).
> > and prints an error to syslog.  A reboot is then required to
> > get the device working again.
> >  
> > -STEP 2: MMIO Enabled
> > +STEP 2: Link Reset
> > +--
> > +The platform resets the link.  This is a PCI-Express specific step
> > +and is done whenever a fatal error has been detected that can be
> > +"solved" by resetting the link.
> > +
> > +
> > +STEP 3: MMIO Enabled
> >  
> >  The platform re-enables MMIO to the device (but typically not the
> >  DMA), and then calls the mmio_enabled() callback on all affected
> > @@ -197,8 +204,8 @@ information, if any, and eventually do things like 
> > trigger a device local
> >  reset or some such, but not restart operations. This callback is made if
> >  all drivers on a segment agree that they can try to recover and if no 
> > automatic
> >  link reset was performed by the HW. If the platform can't just re-enable 
> > IOs
> > -without a slot reset or a link reset, it will not call this callback, and
> > -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot 
> > Reset)
> > +without a slot reset, it will not call this callback, and
> > +instead will have gone directly or STEP 4 (Slot Reset)
> 
> s/or/to/  ?
> 
> >  .. note::
> >  
> > @@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) 
> > or STEP 4 (Slot Reset)
> > such an error might cause IOs to be re-blocked for the whole
> > segment, and thus invalidate the recovery that other devices
> > on the same segment might have done, forcing the whole segment
> > -   into one of the next states, that is, link reset or slot reset.
> > +   into next states, that is, slot reset.
> 
> s/into next states/into the next state/ ?
> 
> >  The driver should return one of the following result codes:
> >- PCI_ERS_RESULT_RECOVERED
> > @@ -233,17 +240,11 @@ The driver should return one of the following result 
> > codes:
> >  
> >  The next step taken depends on the results returned by the drivers.
> >  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> > -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
> > +proceeds to STEP 5 (Resume Operations).
> >  
> >  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
> >  proceeds to STEP 4 (Slot Reset)
> >  
> > -STEP 3: Link Reset
> > ---
> > -The platform resets the link.  This is a PCI-Express specific step
> > -and is done whenever a fatal error has been detected that can be
> > -"solved" by resetting the link.
> > -
> >  STEP 4: Slot Reset
> >  --
> >  
> > -- 
> > 2.25.1
> >

[powerpc:next] BUILD REGRESSION 4ebbbaa4ce8524b853dd6febf0176a6efa3482d7

2021-07-01 Thread kernel test robot

 randconfig-a005-20210630
i386 randconfig-a006-20210630
i386 randconfig-a015-20210701
i386 randconfig-a016-20210701
i386 randconfig-a011-20210701
i386 randconfig-a012-20210701
i386 randconfig-a013-20210701
i386 randconfig-a014-20210701
i386 randconfig-a014-20210630
i386 randconfig-a011-20210630
i386 randconfig-a016-20210630
i386 randconfig-a012-20210630
i386 randconfig-a013-20210630
i386 randconfig-a015-20210630
riscvnommu_k210_defconfig
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
um   x86_64_defconfig
um i386_defconfig
umkunit_defconfig
x86_64   allyesconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-b001-20210630
x86_64   randconfig-b001-20210702
x86_64   randconfig-a004-20210702
x86_64   randconfig-a005-20210702
x86_64   randconfig-a002-20210702
x86_64   randconfig-a006-20210702
x86_64   randconfig-a003-20210702
x86_64   randconfig-a001-20210702
x86_64   randconfig-a012-20210630
x86_64   randconfig-a015-20210630
x86_64   randconfig-a016-20210630
x86_64   randconfig-a013-20210630
x86_64   randconfig-a011-20210630
x86_64   randconfig-a014-20210630

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

[powerpc:merge] BUILD SUCCESS e289c2e239c638cab7e71143e0a65c7c4a057ad7

2021-07-01 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: e289c2e239c638cab7e71143e0a65c7c4a057ad7  Automatic merge of 
'next' into merge (2021-07-01 23:06)

elapsed time: 729m

configs tested: 116
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
powerpcsam440ep_defconfig
powerpc powernv_defconfig
powerpc  acadia_defconfig
um  defconfig
sh  kfr2r09_defconfig
mips  bmips_stb_defconfig
powerpc  g5_defconfig
mips   bmips_be_defconfig
arm  ixp4xx_defconfig
armoxnas_v6_defconfig
arm axm55xx_defconfig
powerpc sequoia_defconfig
xtensa   common_defconfig
arm   spear13xx_defconfig
mips bigsur_defconfig
ia64generic_defconfig
arc nsimosci_hs_defconfig
xtensa  nommu_kc705_defconfig
mips loongson2k_defconfig
sh  sdk7780_defconfig
arm assabet_defconfig
m68kmvme147_defconfig
m68km5272c3_defconfig
powerpc mpc837x_rdb_defconfig
x86_64allnoconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a004-20210630
i386 randconfig-a001-20210630
i386 randconfig-a003-20210630
i386 randconfig-a002-20210630
i386 randconfig-a005-20210630
i386 randconfig-a006-20210630
x86_64   randconfig-a002-20210630
x86_64   randconfig-a001-20210630
x86_64   randconfig-a004-20210630
x86_64   randconfig-a005-20210630
x86_64   randconfig-a006-20210630
x86_64   randconfig-a003-20210630
i386 randconfig-a014-20210630
i386 randconfig-a011-20210630
i386 randconfig-a016-20210630
i386 randconfig-a012-20210630
i386 randconfig-a013-20210630
i386 randconfig-a015-20210630
i386 randconfig-a015-20210701
i386 randconfig-a016-20210701
i386 randconfig-a011-20210701
i386 randconfig-a012-20210701
i386 randconfig-a013-20210701
i386 randconfig-a014-20210701
riscvnommu_k210_defconfig
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
um   x86_64_defconfig
um i386_defconfig
umkunit_defconfig
x86_64   allyesconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64

Re: [PATCH 1/2] powerpc/bpf: Fix detecting BPF atomic instructions

2021-07-01 Thread Alexei Starovoitov

On Thu, Jul 1, 2021 at 12:32 PM Naveen N. Rao
 wrote:
>
> Alexei Starovoitov wrote:
> > On Thu, Jul 1, 2021 at 8:09 AM Naveen N. Rao
> >  wrote:
> >>
> >> Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
> >> atomics in .imm") converted BPF_XADD to BPF_ATOMIC and added a way to
> >> distinguish instructions based on the immediate field. Existing JIT
> >> implementations were updated to check for the immediate field and to
> >> reject programs utilizing anything more than BPF_ADD (such as BPF_FETCH)
> >> in the immediate field.
> >>
> >> However, the check added to powerpc64 JIT did not look at the correct
> >> BPF instruction. Due to this, such programs would be accepted and
> >> incorrectly JIT'ed resulting in soft lockups, as seen with the atomic
> >> bounds test. Fix this by looking at the correct immediate value.
> >>
> >> Fixes: 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other 
> >> atomics in .imm")
> >> Reported-by: Jiri Olsa 
> >> Tested-by: Jiri Olsa 
> >> Signed-off-by: Naveen N. Rao 
> >> ---
> >> Hi Jiri,
> >> FYI: I made a small change in this patch -- using 'imm' directly, rather
> >> than insn[i].imm. I've still added your Tested-by since this shouldn't
> >> impact the fix in any way.
> >>
> >> - Naveen
> >
> > Excellent debugging! You guys are awesome.
>
> Thanks. Jiri and Brendan did the bulk of the work :)
>
> > How do you want this fix routed? via bpf tree?
>
> Michael has a few BPF patches queued up in powerpc tree for v5.14, so it
> might be easier to take these patches through the powerpc tree unless he
> feels otherwise. Michael?

Works for me. Thanks!

Re: [PATCH 1/2] powerpc/bpf: Fix detecting BPF atomic instructions

2021-07-01 Thread Naveen N. Rao


Alexei Starovoitov wrote:

On Thu, Jul 1, 2021 at 8:09 AM Naveen N. Rao
 wrote:


Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and added a way to
distinguish instructions based on the immediate field. Existing JIT
implementations were updated to check for the immediate field and to
reject programs utilizing anything more than BPF_ADD (such as BPF_FETCH)
in the immediate field.

However, the check added to powerpc64 JIT did not look at the correct
BPF instruction. Due to this, such programs would be accepted and
incorrectly JIT'ed resulting in soft lockups, as seen with the atomic
bounds test. Fix this by looking at the correct immediate value.

Fixes: 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other atomics in 
.imm")
Reported-by: Jiri Olsa 
Tested-by: Jiri Olsa 
Signed-off-by: Naveen N. Rao 
---
Hi Jiri,
FYI: I made a small change in this patch -- using 'imm' directly, rather
than insn[i].imm. I've still added your Tested-by since this shouldn't
impact the fix in any way.

- Naveen


Excellent debugging! You guys are awesome.


Thanks. Jiri and Brendan did the bulk of the work :)


How do you want this fix routed? via bpf tree?


Michael has a few BPF patches queued up in powerpc tree for v5.14, so it 
might be easier to take these patches through the powerpc tree unless he 
feels otherwise. Michael?


This also needs to be tagged for stable:
Cc: sta...@vger.kernel.org # 5.12+


- Naveen

Re: [PATCH 2/2] powerpc/bpf: Reject atomic ops in ppc32 JIT

2021-07-01 Thread Naveen N. Rao


Christophe Leroy wrote:



Le 01/07/2021 à 17:08, Naveen N. Rao a écrit :

Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and updated all JIT
implementations to reject JIT'ing instructions with an immediate value
different from BPF_ADD. However, ppc32 BPF JIT was implemented around
the same time and didn't include the same change. Update the ppc32 JIT
accordingly.

Signed-off-by: Naveen N. Rao 


Shouldn't it also include a Fixes tag and stable Cc as PPC32 eBPF was added in 
5.13 ?


Yes, I wasn't sure which patch to actually blame. But you're right, this 
should have the below fixes tag since this affects the ppc32 eBPF JIT.




Fixes: 51c66ad849a7 ("powerpc/bpf: Implement extended BPF on PPC32")
Cc: sta...@vger.kernel.org


Cc: sta...@vger.kernel.org # v5.13


Thanks,
- Naveen

Re: [PATCH v2 3/4] powerpc: wii.dts: Expose the OTP on this platform

2021-07-01 Thread Emmanuel Gil Peyrot

On Sat, Jun 26, 2021 at 11:34:01PM +, Jonathan Neuschäfer wrote:
> On Wed, May 19, 2021 at 11:50:43AM +0200, Emmanuel Gil Peyrot wrote:
> > This can be used by the newly-added nintendo-otp nvmem module.
> > 
> > Signed-off-by: Emmanuel Gil Peyrot 
> > ---
> >  arch/powerpc/boot/dts/wii.dts | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/arch/powerpc/boot/dts/wii.dts b/arch/powerpc/boot/dts/wii.dts
> > index aaa381da1906..7837c4a3f09c 100644
> > --- a/arch/powerpc/boot/dts/wii.dts
> > +++ b/arch/powerpc/boot/dts/wii.dts
> > @@ -219,6 +219,11 @@ control@d800100 {
> > reg = <0x0d800100 0x300>;
> > };
> >  
> > +   otp@d8001ec {
> > +   compatible = "nintendo,hollywood-otp";
> > +   reg = <0x0d8001ec 0x8>;
> 
> The OTP registers overlap with the previous node, control@d800100.
> Not sure what's the best way to structure the devicetree in this case,
> maybe something roughly like the following (untested, unverified):
[snip]

I couldn’t get this to work, but additionally it looks like it should
start 0x100 earlier and contain pic1@d800030 and gpio@d8000c0, given
https://wiibrew.org/wiki/Hardware/Hollywood_Registers

Would it make sense, for the time being, to reduce the size of this
control@d800100 device to the single register currently being used by
arch/powerpc/platforms/embedded6xx/wii.c (0xd800194, used to reboot the
system) and leave the refactor of restart + OTP + PIC + GPIO for a
future series?

Thanks,

-- 
Emmanuel Gil Peyrot


signature.asc
Description: PGP signature

Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-01 Thread Nicholas Piggin

Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> "Christopher M. Riedl"  writes:
>> >>
>> >> > Switching to a different mm with Hash translation causes SLB entries to
>> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>> >> > example when threads share a common mm but operate on different address
>> >> > ranges.
>> >> >
>> >> > Preloading entries from the thread_info struct may not always be
>> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> >> > SLB preload code into a separate function since switch_slb() is already
>> >> > quite long. The default behavior (preloading SLB entries from the
>> >> > current thread_info struct) remains unchanged.
>> >> >
>> >> > Signed-off-by: Christopher M. Riedl 
>> >> >
>> >> > ---
>> >> >
>> >> > v4:  * New to series.
>> >> > ---
>> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
>> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 ++--
>> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >
>> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
>> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> > u32 pkey_allocation_map;
>> >> > s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >  #endif
>> >> > +
>> >> > +   /* Do not preload SLB entries from thread_info during 
>> >> > switch_slb() */
>> >> > +   bool skip_slb_preload;
>> >> >  } mm_context_t;
>> >> >  
>> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
>> >> > b/arch/powerpc/include/asm/mmu_context.h
>> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct 
>> >> > *oldmm,
>> >> > return 0;
>> >> >  }
>> >> >  
>> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> > +{
>> >> > +   mm->context.skip_slb_preload = true;
>> >> > +}
>> >> > +
>> >> > +#else
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> > +
>> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> > +
>> >> >  #include 
>> >> >  
>> >> >  #endif /* __KERNEL__ */
>> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
>> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, 
>> >> > struct mm_struct *mm)
>> >> > atomic_set(>context.active_cpus, 0);
>> >> > atomic_set(>context.copros, 0);
>> >> >  
>> >> > +   mm->context.skip_slb_preload = false;
>> >> > +
>> >> > return 0;
>> >> >  }
>> >> >  
>> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
>> >> > b/arch/powerpc/mm/book3s64/slb.c
>> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
>> >> > index)
>> >> > asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >  }
>> >> >  
>> >> > +static void preload_slb_entries(struct task_struct *tsk, struct 
>> >> > mm_struct *mm)
>> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> switch_slb is probably a fairly hot path on hash?
>> > 
>> > Yes absolutely. I'll make this change in v5.
>> > 
>> >>
>> >> > +{
>> >> > +   struct thread_info *ti = task_thread_info(tsk);
>> >> > +   unsigned char i;
>> >> > +
>> >> > +   /*
>> >> > +* We gradually age out SLBs after a number of context switches 
>> >> > to
>> >> > +* reduce reload overhead of unused entries (like we do with 
>> >> > FP/VEC
>> >> > +* reload). Each time we wrap 256 switches, take an entry out 
>> >> > of the
>> >> > +* SLB preload cache.
>> >> > +*/
>> >> > +   tsk->thread.load_slb++;
>> >> > +   if (!tsk->thread.load_slb) {
>> >> > +   unsigned long pc = KSTK_EIP(tsk);
>> >> > +
>> >> > +   preload_age(ti);
>> >> > +

Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-01 Thread Christopher M. Riedl

On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> > the Book3s64 Hash MMU operates - by default the space above
> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> > all platforms/MMUs is randomized inside this range.  The number of
> > possible random addresses is dependent on PAGE_SIZE and limited by
> > DEFAULT_MAP_WINDOW.
> > 
> > Bits of entropy with 64K page size on BOOK3S_64:
> > 
> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > 
> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > bits of entropy = log2(128TB / 64K) bits of entropy = 31
> > 
> > Randomization occurs only once during initialization at boot.
> > 
> > Introduce two new functions, map_patch() and unmap_patch(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> > the page for patching with PAGE_SHARED since the kernel cannot access
> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> > 
> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> > taking an SLB and Hash fault during patching.
>
> What prevents the SLBE or HPTE from being removed before the last
> access?

This code runs with local IRQs disabled - we also don't access anything
else in userspace so I'm not sure what else could cause the entries to
be removed TBH.

>
>
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -   struct vm_struct *area;
> > +   int err;
> >  
> > -   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -   if (!area) {
> > -   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -   cpu);
> > -   return -1;
> > -   }
> > -   this_cpu_write(text_poke_area, area);
> > +   if (radix_enabled())
> > +   return 0;
> >  
> > -   return 0;
> > -}
> > +   err = slb_allocate_user(patching_mm, patching_addr);
> > +   if (err)
> > +   pr_warn("map patch: failed to allocate slb entry\n");
> >  
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -   free_vm_area(this_cpu_read(text_poke_area));
> > -   return 0;
> > +   err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +  HPTE_USE_KERNEL_KEY);
> > +   if (err)
> > +   pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +   /* See comment in switch_slb() in mm/book3s64/slb.c */
> > +   isync();
>
> I'm not sure if this is enough. Could we context switch here? You've
> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> How about taking an machine check that clears the SLB? Could the HPTE
> get removed by something else here?

All of this happens after a local_irq_save() which should at least
prevent context switches IIUC. I am not sure what else could cause the
HPTE to get removed here.

>
> You want to prevent faults because you might be patching a fault
> handler?

In a more general sense: I don't think we want to take page faults every
time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
fault handler codepath also checks `current->mm` in some places which
won't match the temporary mm. Also `current->mm` can be NULL which
caused problems in my earlier revisions of this series.

>
> Thanks,
> Nick

[PATCH 2/2] powerpc/numa: Update cpu_cpu_map on CPU online/offline

2021-07-01 Thread Srikar Dronamraju

cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER:
0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  |  7 ++-
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6c6e4d934d86..e562cca13d66 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
+   unmap_cpu_from_node(cpu);
+
if (shared_caches)
mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 * This CPU will not be in the online mask yet so we need to manually
 * add it to it's own thread sibling mask.

[PATCH 1/2] powerpc/numa: Print debug statements only when required

2021-07-01 Thread Srikar Dronamraju

Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6d0d89127190..f68dbe4e982c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   dbg("adding cpu %d to node %d\n", cpu, node);
 
-   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+   dbg("adding cpu %d to node %d\n", cpu, node);
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+   }
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   dbg("removing cpu %lu from node %d\n", cpu, node);
-
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+   dbg("removing cpu %lu from node %d\n", cpu, node);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-  cpu, node);
+   pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
-- 
2.27.0

Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-01 Thread Nicholas Piggin

Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped as writeable. Currently, a
> per-cpu vmalloc patch area is used for this purpose. While the patch
> area is per-cpu, the temporary page mapping is inserted into the kernel
> page tables for the duration of patching. The mapping is exposed to CPUs
> other than the patching CPU - this is undesirable from a hardening
> perspective. Use a temporary mm instead which keeps the mapping local to
> the CPU doing the patching.
> 
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> the Book3s64 Hash MMU operates - by default the space above
> DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> all platforms/MMUs is randomized inside this range.  The number of
> possible random addresses is dependent on PAGE_SIZE and limited by
> DEFAULT_MAP_WINDOW.
> 
> Bits of entropy with 64K page size on BOOK3S_64:
> 
> bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> 
> PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> bits of entropy = log2(128TB / 64K) bits of entropy = 31
> 
> Randomization occurs only once during initialization at boot.
> 
> Introduce two new functions, map_patch() and unmap_patch(), to
> respectively create and remove the temporary mapping with write
> permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> the page for patching with PAGE_SHARED since the kernel cannot access
> userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> 
> Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> for the patching_addr when using the Hash MMU on Book3s64 to avoid
> taking an SLB and Hash fault during patching.

What prevents the SLBE or HPTE from being removed before the last
access?


> +#ifdef CONFIG_PPC_BOOK3S_64
> +
> +static inline int hash_prefault_mapping(pgprot_t pgprot)
>  {
> - struct vm_struct *area;
> + int err;
>  
> - area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> - if (!area) {
> - WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> - cpu);
> - return -1;
> - }
> - this_cpu_write(text_poke_area, area);
> + if (radix_enabled())
> + return 0;
>  
> - return 0;
> -}
> + err = slb_allocate_user(patching_mm, patching_addr);
> + if (err)
> + pr_warn("map patch: failed to allocate slb entry\n");
>  
> -static int text_area_cpu_down(unsigned int cpu)
> -{
> - free_vm_area(this_cpu_read(text_poke_area));
> - return 0;
> + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> +HPTE_USE_KERNEL_KEY);
> + if (err)
> + pr_warn("map patch: failed to insert hashed page\n");
> +
> + /* See comment in switch_slb() in mm/book3s64/slb.c */
> + isync();

I'm not sure if this is enough. Could we context switch here? You've
got the PTL so no with a normal kernel but maybe yes with an RT kernel
How about taking an machine check that clears the SLB? Could the HPTE
get removed by something else here?

You want to prevent faults because you might be patching a fault 
handler?

Thanks,
Nick

Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-01 Thread Christopher M. Riedl

On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> "Christopher M. Riedl"  writes:
> >> >>
> >> >> > Switching to a different mm with Hash translation causes SLB entries 
> >> >> > to
> >> >> > be preloaded from the current thread_info. This reduces SLB faults, 
> >> >> > for
> >> >> > example when threads share a common mm but operate on different 
> >> >> > address
> >> >> > ranges.
> >> >> >
> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> > appropriate - such as when switching to a temporary mm. Introduce a 
> >> >> > new
> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move 
> >> >> > the
> >> >> > SLB preload code into a separate function since switch_slb() is 
> >> >> > already
> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> > current thread_info struct) remains unchanged.
> >> >> >
> >> >> > Signed-off-by: Christopher M. Riedl 
> >> >> >
> >> >> > ---
> >> >> >
> >> >> > v4:  * New to series.
> >> >> > ---
> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 
> >> >> > ++--
> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> >> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >   u32 pkey_allocation_map;
> >> >> >   s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >  #endif
> >> >> > +
> >> >> > + /* Do not preload SLB entries from thread_info during 
> >> >> > switch_slb() */
> >> >> > + bool skip_slb_preload;
> >> >> >  } mm_context_t;
> >> >> >  
> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> >> >> > b/arch/powerpc/include/asm/mmu_context.h
> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct 
> >> >> > *oldmm,
> >> >> >   return 0;
> >> >> >  }
> >> >> >  
> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> > +{
> >> >> > + mm->context.skip_slb_preload = true;
> >> >> > +}
> >> >> > +
> >> >> > +#else
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> >> > +
> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> >> > +
> >> >> >  #include 
> >> >> >  
> >> >> >  #endif /* __KERNEL__ */
> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
> >> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > index c10fc8a72fb37..3479910264c59 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, 
> >> >> > struct mm_struct *mm)
> >> >> >   atomic_set(>context.active_cpus, 0);
> >> >> >   atomic_set(>context.copros, 0);
> >> >> >  
> >> >> > + mm->context.skip_slb_preload = false;
> >> >> > +
> >> >> >   return 0;
> >> >> >  }
> >> >> >  
> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
> >> >> > b/arch/powerpc/mm/book3s64/slb.c
> >> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
> >> >> > index)
> >> >> >   asm volatile("slbie %0" : : "r" (slbie_data));
> >> >> >  }
> >> >> >  
> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct 
> >> >> > mm_struct *mm)
> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> >> switch_slb is probably a fairly hot path on hash?
> >> > 
> >> > Yes absolutely. I'll make this change in v5.
> >> > 
> >> >>
> >> >> > +{
> >> >> > + struct thread_info *ti = task_thread_info(tsk);
> >> >> > + unsigned char i;
> >> >> > +
> >> >> > + /*
> >> >> > +  * We gradually age out SLBs after a number of context switches 
> >> >> > to
> >> >> > +  * reduce reload overhead of unused entries (like we do with 
> >> >> > FP/VEC
> >> >> > +  * reload). Each time we wrap

Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-01 Thread Will Deacon

On Wed, Jun 30, 2021 at 08:56:51AM -0700, Nathan Chancellor wrote:
> On Wed, Jun 30, 2021 at 12:43:48PM +0100, Will Deacon wrote:
> > On Wed, Jun 30, 2021 at 05:17:27PM +0800, Claire Chang wrote:
> > > `BUG: unable to handle page fault for address: 003a8290` and
> > > the fact it crashed at `_raw_spin_lock_irqsave` look like the memory
> > > (maybe dev->dma_io_tlb_mem) was corrupted?
> > > The dev->dma_io_tlb_mem should be set here
> > > (https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/pci/probe.c#n2528)
> > > through device_initialize.
> > 
> > I'm less sure about this. 'dma_io_tlb_mem' should be pointing at
> > 'io_tlb_default_mem', which is a page-aligned allocation from memblock.
> > The spinlock is at offset 0x24 in that structure, and looking at the
> > register dump from the crash:
> > 
> > Jun 29 18:28:42 hp-4300G kernel: RSP: 0018:adb4013db9e8 EFLAGS: 00010006
> > Jun 29 18:28:42 hp-4300G kernel: RAX: 003a8290 RBX: 
> >  RCX: 8900572ad580
> > Jun 29 18:28:42 hp-4300G kernel: RDX: 89005653f024 RSI: 
> > 000c RDI: 1d17
> > Jun 29 18:28:42 hp-4300G kernel: RBP: 0a20d000 R08: 
> > 000c R09: 
> > Jun 29 18:28:42 hp-4300G kernel: R10: 0a20d000 R11: 
> > 89005653f000 R12: 0212
> > Jun 29 18:28:42 hp-4300G kernel: R13: 1000 R14: 
> > 0002 R15: 0020
> > Jun 29 18:28:42 hp-4300G kernel: FS:  7f1f8898ea40() 
> > GS:89005728() knlGS:
> > Jun 29 18:28:42 hp-4300G kernel: CS:  0010 DS:  ES:  CR0: 
> > 80050033
> > Jun 29 18:28:42 hp-4300G kernel: CR2: 003a8290 CR3: 
> > 0001020d CR4: 00350ee0
> > Jun 29 18:28:42 hp-4300G kernel: Call Trace:
> > Jun 29 18:28:42 hp-4300G kernel:  _raw_spin_lock_irqsave+0x39/0x50
> > Jun 29 18:28:42 hp-4300G kernel:  swiotlb_tbl_map_single+0x12b/0x4c0
> > 
> > Then that correlates with R11 holding the 'dma_io_tlb_mem' pointer and
> > RDX pointing at the spinlock. Yet RAX is holding junk :/
> > 
> > I agree that enabling KASAN would be a good idea, but I also think we
> > probably need to get some more information out of swiotlb_tbl_map_single()
> > to see see what exactly is going wrong in there.
> 
> I can certainly enable KASAN and if there is any debug print I can add
> or dump anything, let me know!

I bit the bullet and took v5.13 with swiotlb/for-linus-5.14 merged in, built
x86 defconfig and ran it on my laptop. However, it seems to work fine!

Please can you share your .config?

Will

Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-01 Thread Nicholas Piggin

Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
> On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
>> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> >> "Christopher M. Riedl"  writes:
>> >> >>
>> >> >> > Switching to a different mm with Hash translation causes SLB entries 
>> >> >> > to
>> >> >> > be preloaded from the current thread_info. This reduces SLB faults, 
>> >> >> > for
>> >> >> > example when threads share a common mm but operate on different 
>> >> >> > address
>> >> >> > ranges.
>> >> >> >
>> >> >> > Preloading entries from the thread_info struct may not always be
>> >> >> > appropriate - such as when switching to a temporary mm. Introduce a 
>> >> >> > new
>> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move 
>> >> >> > the
>> >> >> > SLB preload code into a separate function since switch_slb() is 
>> >> >> > already
>> >> >> > quite long. The default behavior (preloading SLB entries from the
>> >> >> > current thread_info struct) remains unchanged.
>> >> >> >
>> >> >> > Signed-off-by: Christopher M. Riedl 
>> >> >> >
>> >> >> > ---
>> >> >> >
>> >> >> > v4:  * New to series.
>> >> >> > ---
>> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
>> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 
>> >> >> > ++--
>> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >> >
>> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
>> >> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> >> >  u32 pkey_allocation_map;
>> >> >> >  s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >> >  #endif
>> >> >> > +
>> >> >> > +/* Do not preload SLB entries from thread_info during 
>> >> >> > switch_slb() */
>> >> >> > +bool skip_slb_preload;
>> >> >> >  } mm_context_t;
>> >> >> >  
>> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
>> >> >> > b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct 
>> >> >> > mm_struct *oldmm,
>> >> >> >  return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> >> > +{
>> >> >> > +mm->context.skip_slb_preload = true;
>> >> >> > +}
>> >> >> > +
>> >> >> > +#else
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> >> > +
>> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> >> > +
>> >> >> >  #include 
>> >> >> >  
>> >> >> >  #endif /* __KERNEL__ */
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
>> >> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, 
>> >> >> > struct mm_struct *mm)
>> >> >> >  atomic_set(>context.active_cpus, 0);
>> >> >> >  atomic_set(>context.copros, 0);
>> >> >> >  
>> >> >> > +mm->context.skip_slb_preload = false;
>> >> >> > +
>> >> >> >  return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
>> >> >> > b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
>> >> >> > index)
>> >> >> >  asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >> >  }
>> >> >> >  
>> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct 
>> >> >> > mm_struct *mm)
>> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> >> switch_slb is probably a fairly hot path on hash?
>> >> > 
>> >> > Yes absolutely. I'll make this change in v5.
>> >> > 
>> >> >>
>> >> >> > +{
>> >> >> > +struct thread_info *ti = task_thread_info(tsk);
>> >> >> > +unsigned char i;
>> >> >> > +
>> >> >> > +/*
>> >> >> > + * We gradually age out SLBs after a number

Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-01 Thread Nicholas Piggin

Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
>> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
>> > address to be patched is temporarily mapped as writeable. Currently, a
>> > per-cpu vmalloc patch area is used for this purpose. While the patch
>> > area is per-cpu, the temporary page mapping is inserted into the kernel
>> > page tables for the duration of patching. The mapping is exposed to CPUs
>> > other than the patching CPU - this is undesirable from a hardening
>> > perspective. Use a temporary mm instead which keeps the mapping local to
>> > the CPU doing the patching.
>> > 
>> > Use the `poking_init` init hook to prepare a temporary mm and patching
>> > address. Initialize the temporary mm by copying the init mm. Choose a
>> > randomized patching address inside the temporary mm userspace address
>> > space. The patching address is randomized between PAGE_SIZE and
>> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
>> > the Book3s64 Hash MMU operates - by default the space above
>> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
>> > all platforms/MMUs is randomized inside this range.  The number of
>> > possible random addresses is dependent on PAGE_SIZE and limited by
>> > DEFAULT_MAP_WINDOW.
>> > 
>> > Bits of entropy with 64K page size on BOOK3S_64:
>> > 
>> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>> > 
>> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>> > bits of entropy = log2(128TB / 64K) bits of entropy = 31
>> > 
>> > Randomization occurs only once during initialization at boot.
>> > 
>> > Introduce two new functions, map_patch() and unmap_patch(), to
>> > respectively create and remove the temporary mapping with write
>> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
>> > the page for patching with PAGE_SHARED since the kernel cannot access
>> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
>> > 
>> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
>> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
>> > taking an SLB and Hash fault during patching.
>>
>> What prevents the SLBE or HPTE from being removed before the last
>> access?
> 
> This code runs with local IRQs disabled - we also don't access anything
> else in userspace so I'm not sure what else could cause the entries to
> be removed TBH.
> 
>>
>>
>> > +#ifdef CONFIG_PPC_BOOK3S_64
>> > +
>> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
>> >  {
>> > -  struct vm_struct *area;
>> > +  int err;
>> >  
>> > -  area = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> > -  if (!area) {
>> > -  WARN_ONCE(1, "Failed to create text area for cpu %d\n",
>> > -  cpu);
>> > -  return -1;
>> > -  }
>> > -  this_cpu_write(text_poke_area, area);
>> > +  if (radix_enabled())
>> > +  return 0;
>> >  
>> > -  return 0;
>> > -}
>> > +  err = slb_allocate_user(patching_mm, patching_addr);
>> > +  if (err)
>> > +  pr_warn("map patch: failed to allocate slb entry\n");
>> >  
>> > -static int text_area_cpu_down(unsigned int cpu)
>> > -{
>> > -  free_vm_area(this_cpu_read(text_poke_area));
>> > -  return 0;
>> > +  err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
>> > + HPTE_USE_KERNEL_KEY);
>> > +  if (err)
>> > +  pr_warn("map patch: failed to insert hashed page\n");
>> > +
>> > +  /* See comment in switch_slb() in mm/book3s64/slb.c */
>> > +  isync();
>>
>> I'm not sure if this is enough. Could we context switch here? You've
>> got the PTL so no with a normal kernel but maybe yes with an RT kernel
>> How about taking an machine check that clears the SLB? Could the HPTE
>> get removed by something else here?
> 
> All of this happens after a local_irq_save() which should at least
> prevent context switches IIUC.

Ah yeah I didn't look that far back. A machine check can take out SLB
entries.

> I am not sure what else could cause the
> HPTE to get removed here.

Other CPUs?

>> You want to prevent faults because you might be patching a fault
>> handler?
> 
> In a more general sense: I don't think we want to take page faults every
> time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> fault handler codepath also checks `current->mm` in some places which
> won't match the temporary mm. Also `current->mm` can be NULL which
> caused problems in my earlier revisions of this series.

Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and 
setting current->mm properly would explode too much. Maybe that's okayish.
But I can't see how the HPT code is up to the job of this in general 
(even if that current->mm issue was fixed).

To do it without holes you would either

Re: [PATCH v5] pseries: prevent free CPU ids to be reused on another node

2021-07-01 Thread Laurent Dufour


Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 29/04/2021 à 19:49, Laurent Dufour a écrit :

When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Signed-off-by: Laurent Dufour 
---
V5:
  - Rework code structure
  - Reintroduce the capability to reuse other node's ids.
V4: addressing Nathan's comment
  - Rename the local variable named 'nid' into 'assigned_node'
V3: addressing Nathan's comments
  - Remove the retry feature
  - Reduce the number of local variables (removing 'i')
  - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
  V2: (no functional changes)
  - update the test's output in the commit's description
  - node_recorded_ids_map should be static
---
  arch/powerpc/platforms/pseries/hotplug-cpu.c | 171 ++-
  1 file changed, 132 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..e1f224320102 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,12 @@
  /* This version can't take the spinlock, because it never returns */
  static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
  
+/*

+ * Record the CPU ids used on each nodes.
+ * Protected by cpu_add_remove_lock.
+ */
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
  static void rtas_stop_self(void)
  {
static struct rtas_args args;
@@ -139,72 +145,148 @@ static void pseries_cpu_die(unsigned int cpu)
paca_ptrs[cpu]->cpu_start = 0;
  }
  
+/**

+ * find_cpu_id_range - found a linear ranger of @nthreads free CPU ids.
+ * @nthreads : the number of threads (cpu ids)
+ * @assigned_node : the node it belongs to or NUMA_NO_NODE if free ids from any
+ *  node can be peek.
+ * @cpu_mask: the returned CPU mask.
+ *
+ * Returns 0 on success.
+ */
+static int find_cpu_id_range(unsigned int nthreads, int assigned_node,
+cpumask_var_t *cpu_mask)
+{
+   cpumask_var_t candidate_mask;
+   unsigned int cpu, node;
+   int rc = -ENOSPC;
+
+   if (!zalloc_cpumask_var(_mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   cpumask_clear(*cpu_mask);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, *cpu_mask);
+
+   BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
+
+   /* Get a bitmap of unoccupied slots. */
+   cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   if (assigned_node != NUMA_NO_NODE) {
+   /*
+* Remove

Re: [PATCH v2] ppc64/numa: consider the max numa node for migratable LPAR

2021-07-01 Thread Laurent Dufour


Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 11/05/2021 à 09:31, Laurent Dufour a écrit :

When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without this patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With this patch
applies, the CPU is correctly added to the 3rd node.

Fixes: f9f130ff2ec9 ("powerpc/numa: Detect support for coregroup")
Reviewed-by: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
V2: Address Srikar's comments
  - Fix the commit message
  - Use pr_info instead printk(KERN_INFO..)
---
  arch/powerpc/mm/numa.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..094a1076fd1f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
  static void __init find_possible_nodes(void)
  {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;
  
@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)

 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,8 @@ static void __init find_possible_nodes(void)
}
  
  	max_nodes = of_read_number([min_common_depth], 1);

+   pr_info("Partition configured for %d NUMA nodes.\n", max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);

Re: [PATCH v5] pseries/drmem: update LMBs after LPM

2021-07-01 Thread Laurent Dufour


Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 17/05/2021 à 11:06, Laurent Dufour a écrit :

After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is handled by the kernel, but the memory's node is not updated because
there is no way to move a memory block between nodes from the Linux kernel
point of view.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node ibm,dynamic-reconfiguration-memory to
match the added or removed LMB. But the LMB's associativity node has not
been updated after the DT node update and thus the node is overwritten by
the Linux's topology instead of the hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.

Cc: Nathan Lynch 
Cc: Aneesh Kumar K.V 
Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V5:
  - Reword the commit's description to address Nathan's comments.
V4:
  - Prevent the LMB to be updated back in the case the request came from the
  LMB tree's update.
V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 46 +++
  .../platforms/pseries/hotplug-memory.c|  4 ++
  3 files changed, 51 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..22197b18d85e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -18,6 +18,7 @@ static int n_root_addr_cells, n_root_size_cells;
  
  static struct drmem_lmb_info __drmem_info;

  struct drmem_lmb_info *drmem_info = &__drmem_info;
+static bool in_drmem_update;
  
  u64 drmem_lmb_memory_max(void)

  {
@@ -178,6 +179,11 @@ int drmem_update_dt(void)
if (!memory)
return -1;
  
+	/*

+* Set in_drmem_update to prevent the notifier callback to process the
+* DT property back since the change is coming from the LMB tree.
+*/
+   in_drmem_update = true;
prop = of_find_property(memory, "ibm,dynamic-memory", NULL);
if (prop) {
rc = drmem_update_dt_v1(memory, prop);
@@ -186,6 +192,7 @@ int drmem_update_dt(void)
if (prop)
rc = drmem_update_dt_v2(memory, prop);
}
+   in_drmem_update = false;
  
  	of_node_put(memory);

return rc;
@@ -307,6 +314,45 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   /*
+* Don't update the LMBs if triggered by the update done in
+* drmem_update_dt(), the LMB values have been used to the update the DT
+* property in that case.
+*/
+   if (in_drmem_update)
+   return;
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
  #endif
  
  static int init_drmem_lmb_size(struct

Re: [RFC PATCH 38/43] KVM: PPC: Book3S HV P9: Test dawr_enabled() before saving host DAWR SPRs

2021-07-01 Thread Nicholas Piggin

Excerpts from Fabiano Rosas's message of July 1, 2021 3:51 am:
> Nicholas Piggin  writes:
> 
>> Some of the DAWR SPR access is already predicated on dawr_enabled(),
>> apply this to the remainder of the accesses.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/kvm/book3s_hv_p9_entry.c | 34 ---
>>  1 file changed, 20 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
>> b/arch/powerpc/kvm/book3s_hv_p9_entry.c
>> index 7aa72efcac6c..f305d1d6445c 100644
>> --- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
>> +++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
>> @@ -638,13 +638,16 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
>> time_limit, unsigned long lpc
>>
>>  host_hfscr = mfspr(SPRN_HFSCR);
>>  host_ciabr = mfspr(SPRN_CIABR);
>> -host_dawr0 = mfspr(SPRN_DAWR0);
>> -host_dawrx0 = mfspr(SPRN_DAWRX0);
>>  host_psscr = mfspr(SPRN_PSSCR);
>>  host_pidr = mfspr(SPRN_PID);
>> -if (cpu_has_feature(CPU_FTR_DAWR1)) {
>> -host_dawr1 = mfspr(SPRN_DAWR1);
>> -host_dawrx1 = mfspr(SPRN_DAWRX1);
>> +
>> +if (dawr_enabled()) {
>> +host_dawr0 = mfspr(SPRN_DAWR0);
>> +host_dawrx0 = mfspr(SPRN_DAWRX0);
>> +if (cpu_has_feature(CPU_FTR_DAWR1)) {
>> +host_dawr1 = mfspr(SPRN_DAWR1);
>> +host_dawrx1 = mfspr(SPRN_DAWRX1);
> 
> The userspace needs to enable DAWR1 via KVM_CAP_PPC_DAWR1. That cap is
> not even implemented in QEMU currently, so we never allow the guest to
> set vcpu->arch.dawr1. If we check for kvm->arch.dawr1_enabled instead of
> the CPU feature, we could shave some more time here.

Ah good point, yes let's do that.

Thanks,
Nick

[PATCH 0/2] Update cpu_cpu_mask on CPU online/offline

2021-07-01 Thread Srikar Dronamraju

When simultaneously running CPU online/offline with CPU add/remove in a
loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER: 0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 

Srikar Dronamraju (2):
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU online/offline

 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  | 18 +++---
 3 files changed, 22 insertions(+), 11 deletions(-)

-- 
2.27.0

83 matches

Mail list logo