Re: [PATCH] powerpc/mm: Fix lockup on kernel exec fault

2021-07-01 Thread Christophe Leroy




Le 02/07/2021 à 03:25, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of July 1, 2021 9:17 pm:

The powerpc kernel is not prepared to handle exec faults from kernel.
Especially, the function is_exec_fault() will return 'false' when an
exec fault is taken by kernel, because the check is based on reading
current->thread.regs->trap which contains the trap from user.

For instance, when provoking a LKDTM EXEC_USERSPACE test,
current->thread.regs->trap is set to SYSCALL trap (0xc00), and
the fault taken by the kernel is not seen as an exec fault by
set_access_flags_filter().

Commit d7df2443cd5f ("powerpc/mm: Fix spurrious segfaults on radix
with autonuma") made it clear and handled it properly. But later on
commit d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute
faults") removed that handling, introducing test based on error_code.
And here is the problem, because on the 603 all upper bits of SRR1
get cleared when the TLB instruction miss handler bails out to ISI.


So the problem is 603 doesn't see the DSISR_NOEXEC_OR_G bit?


I a way yes. But the problem is also that the kernel doesn't see it as an exec fault in 
set_access_flags_filter() as explained above. If it could see it as an exec fault, it would set 
PAGE_EXEC and it would work (or maybe not because it seems it also checks for the dirtiness of the 
page, and here the page is also flagged as dirty).


603 will see DSISR_NOEXEC_OR_G if it's an access to a page which is in a 
segment flagged NX.



I don't see the problem with this for 64s, I don't think anything sane
can be done for any 0x400 interrupt in the kernel so it's probably
good to catch all here just in case. For 64s,

Acked-by: Nicholas Piggin 

Why is 32s clearing those top bits? And it seems to be setting DSISR
that AFAIKS it does not use. Seems like it would be good to add a
NOEXEC_OR_G bit into SRR1.


Probably for simplicity.

When taking the Instruction TLB Miss interrupt, SRR1 contains CR0 fields in bits 0-3 and some 
dedicated info in bits 12-15. That doesn't match SRR1 bits for ISI, so before falling back to the 
ISI handler, ITLB Miss handler error patch clears upper SRR1 bits.


Maybe it could instead try to set the right bits, but it would make it more complicated because the 
error patch can be taken for the following reasons:

- No page table
- Not PAGE_PRESENT
- Not PAGE_ACCESSED
- Not PAGE_EXEC
- Below TASK_SIZE and not PAGE_USER

At the time being the verification of the flags is done with a single 'andc' operation. If we wanted 
to set the proper bits, it would mean testing the flags separately, which would impact performance 
on the no-error path.


Or maybe it would be good enough to set the PROTFAULT bit in all cases but the lack of page table. 
The 8xx sets PROTFAULT when hitting non-exec pages, so the kernel is prepared for it anyway. Not 
sure about the lack of PAGE_PRESENT thought. The 8xx sets NOHPTE bit when PAGE_PRESENT is cleared.


But is it really worth doing ?

Christophe


Re: [PATCH v15 12/12] of: Add plumbing for restricted DMA pool

2021-07-01 Thread Guenter Roeck
Hi,

On Thu, Jun 24, 2021 at 11:55:26PM +0800, Claire Chang wrote:
> If a device is not behind an IOMMU, we look up the device node and set
> up the restricted DMA when the restricted-dma-pool is presented.
> 
> Signed-off-by: Claire Chang 
> Tested-by: Stefano Stabellini 
> Tested-by: Will Deacon 

With this patch in place, all sparc and sparc64 qemu emulations
fail to boot. Symptom is that the root file system is not found.
Reverting this patch fixes the problem. Bisect log is attached.

Guenter

---
# bad: [fb0ca446157a86b75502c1636b0d81e642fe6bf1] Add linux-next specific files 
for 20210701
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect start 'HEAD' 'v5.13'
# bad: [f63c4fda987a19b1194cc45cb72fd5bf968d9d90] Merge remote-tracking branch 
'rdma/for-next'
git bisect bad f63c4fda987a19b1194cc45cb72fd5bf968d9d90
# good: [46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4] Merge remote-tracking branch 
'ipsec/master'
git bisect good 46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4
# good: [43ba6969cfb8185353a7a6fc79070f13b9e3d6d3] Merge remote-tracking branch 
'clk/clk-next'
git bisect good 43ba6969cfb8185353a7a6fc79070f13b9e3d6d3
# good: [1ca5eddcf8dca1d6345471c6404e7364af0d7019] Merge remote-tracking branch 
'fuse/for-next'
git bisect good 1ca5eddcf8dca1d6345471c6404e7364af0d7019
# good: [8f6d7b3248705920187263a4e7147b0752ec7dcf] Merge remote-tracking branch 
'pci/next'
git bisect good 8f6d7b3248705920187263a4e7147b0752ec7dcf
# good: [df1885a755784da3ef285f36d9230c1d090ef186] RDMA/rtrs_clt: Alloc less 
memory with write path fast memory registration
git bisect good df1885a755784da3ef285f36d9230c1d090ef186
# good: [93d31efb58c8ad4a66bbedbc2d082df458c04e45] Merge remote-tracking branch 
'cpufreq-arm/cpufreq/arm/linux-next'
git bisect good 93d31efb58c8ad4a66bbedbc2d082df458c04e45
# good: [46308965ae6fdc7c25deb2e8c048510ae51bbe66] RDMA/irdma: Check contents 
of user-space irdma_mem_reg_req object
git bisect good 46308965ae6fdc7c25deb2e8c048510ae51bbe66
# good: [6de7a1d006ea9db235492b288312838d6878385f] 
thermal/drivers/int340x/processor_thermal: Split enumeration and processing part
git bisect good 6de7a1d006ea9db235492b288312838d6878385f
# good: [081bec2577cda3d04f6559c60b6f4e2242853520] dt-bindings: of: Add 
restricted DMA pool
git bisect good 081bec2577cda3d04f6559c60b6f4e2242853520
# good: [bf95ac0bcd69979af146852f6a617a60285ebbc1] Merge remote-tracking branch 
'thermal/thermal/linux-next'
git bisect good bf95ac0bcd69979af146852f6a617a60285ebbc1
# good: [3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc] RDMA/core: Always release 
restrack object
git bisect good 3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc
# bad: [cff1f23fad6e0bd7d671acce0d15285c709f259c] Merge remote-tracking branch 
'swiotlb/linux-next'
git bisect bad cff1f23fad6e0bd7d671acce0d15285c709f259c
# bad: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing for 
restricted DMA pool
git bisect bad b655006619b7bccd0dc1e055bd72de5d613e7b5c
# first bad commit: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing 
for restricted DMA pool


[PATCH v2] Documentation: PCI: pci-error-recovery: swap sequence between MMIO Enabled and Link Reset

2021-07-01 Thread Wesley Sheng
Reset_link() callback function (named with reset_subordinates()
in pcie_do_recovery() function) was called before mmio_enabled(),
so exchange the sequence between step 2 MMIO Enabled and step 3
Link Reset accordingly.

Signed-off-by: Wesley Sheng 
---
 Documentation/PCI/pci-error-recovery.rst | 25 
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst 
b/Documentation/PCI/pci-error-recovery.rst
index 187f43a03200..0e2f3f77bf0a 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -157,7 +157,7 @@ drivers.
 If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
 then the platform should re-enable IOs on the slot (or do nothing in
 particular, if the platform doesn't isolate slots), and recovery
-proceeds to STEP 2 (MMIO Enable).
+proceeds to STEP 3 (MMIO Enable).
 
 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
 then recovery proceeds to STEP 4 (Slot Reset).
@@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure).
and prints an error to syslog.  A reboot is then required to
get the device working again.
 
-STEP 2: MMIO Enabled
+STEP 2: Link Reset
+--
+The platform resets the link.  This is a PCI-Express specific step
+and is done whenever a fatal error has been detected that can be
+"solved" by resetting the link.
+
+
+STEP 3: MMIO Enabled
 
 The platform re-enables MMIO to the device (but typically not the
 DMA), and then calls the mmio_enabled() callback on all affected
@@ -197,8 +204,8 @@ information, if any, and eventually do things like trigger 
a device local
 reset or some such, but not restart operations. This callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
-without a slot reset or a link reset, it will not call this callback, and
-instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+without a slot reset, it will not call this callback, and
+instead will have gone directly to STEP 4 (Slot Reset)
 
 .. note::
 
@@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) or 
STEP 4 (Slot Reset)
such an error might cause IOs to be re-blocked for the whole
segment, and thus invalidate the recovery that other devices
on the same segment might have done, forcing the whole segment
-   into one of the next states, that is, link reset or slot reset.
+   into the next states, that is, slot reset.
 
 The driver should return one of the following result codes:
   - PCI_ERS_RESULT_RECOVERED
@@ -233,17 +240,11 @@ The driver should return one of the following result 
codes:
 
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
-proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
+proceeds to STEP 5 (Resume Operations).
 
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 
-STEP 3: Link Reset
---
-The platform resets the link.  This is a PCI-Express specific step
-and is done whenever a fatal error has been detected that can be
-"solved" by resetting the link.
-
 STEP 4: Slot Reset
 --
 
-- 
2.25.1



Re: [PATCH] Documentation: PCI: pci-error-recovery: rearrange the general sequence

2021-07-01 Thread Wesley Sheng
On Thu, Jul 01, 2021 at 05:22:31PM -0500, Bjorn Helgaas wrote:
> Please make the subject a little more specific.  "rearrange the
> general sequence" doesn't say anything about what was affected.
> 
> On Fri, Jun 18, 2021 at 02:04:46PM +0800, Wesley Sheng wrote:
> > Reset_link() callback function was called before mmio_enabled() in
> > pcie_do_recovery() function actually, so rearrange the general
> > sequence betwen step 2 and step 3 accordingly.
> 
> s/betwen/between/
> 
> Not sure "general" adds anything in this sentence.  "Step 2 and step
> 3" are not meaningful here in the commit log.  It needs to spell out
> what those steps are so the log makes sense by itself.
> 
> "reset_link" does not appear in pcie_do_recovery().  I'm guessing
> you're referring to the "reset_subordinates" function pointer?
>
Yes, you are right.
pcieaer-howto.rst has a section named with "Provide callbacks",
the callback supplied to pcie_do_recovery() was referred to 
reset_link.
 
> > Signed-off-by: Wesley Sheng 
> 
> I didn't quite understand your response to Oliver, so I'll wait for
> your corrections and his ack before proceeding.
>
OK.
I thought step 2 MMIO Enabled and step 3 link reset should swap sequence.

> > ---
> >  Documentation/PCI/pci-error-recovery.rst | 23 ---
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> > 
> > diff --git a/Documentation/PCI/pci-error-recovery.rst 
> > b/Documentation/PCI/pci-error-recovery.rst
> > index 187f43a03200..ac6a8729ef28 100644
> > --- a/Documentation/PCI/pci-error-recovery.rst
> > +++ b/Documentation/PCI/pci-error-recovery.rst
> > @@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure).
> > and prints an error to syslog.  A reboot is then required to
> > get the device working again.
> >  
> > -STEP 2: MMIO Enabled
> > +STEP 2: Link Reset
> > +--
> > +The platform resets the link.  This is a PCI-Express specific step
> > +and is done whenever a fatal error has been detected that can be
> > +"solved" by resetting the link.
> > +
> > +
> > +STEP 3: MMIO Enabled
> >  
> >  The platform re-enables MMIO to the device (but typically not the
> >  DMA), and then calls the mmio_enabled() callback on all affected
> > @@ -197,8 +204,8 @@ information, if any, and eventually do things like 
> > trigger a device local
> >  reset or some such, but not restart operations. This callback is made if
> >  all drivers on a segment agree that they can try to recover and if no 
> > automatic
> >  link reset was performed by the HW. If the platform can't just re-enable 
> > IOs
> > -without a slot reset or a link reset, it will not call this callback, and
> > -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot 
> > Reset)
> > +without a slot reset, it will not call this callback, and
> > +instead will have gone directly or STEP 4 (Slot Reset)
> 
> s/or/to/  ?
> 
> >  .. note::
> >  
> > @@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) 
> > or STEP 4 (Slot Reset)
> > such an error might cause IOs to be re-blocked for the whole
> > segment, and thus invalidate the recovery that other devices
> > on the same segment might have done, forcing the whole segment
> > -   into one of the next states, that is, link reset or slot reset.
> > +   into next states, that is, slot reset.
> 
> s/into next states/into the next state/ ?
> 
> >  The driver should return one of the following result codes:
> >- PCI_ERS_RESULT_RECOVERED
> > @@ -233,17 +240,11 @@ The driver should return one of the following result 
> > codes:
> >  
> >  The next step taken depends on the results returned by the drivers.
> >  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> > -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
> > +proceeds to STEP 5 (Resume Operations).
> >  
> >  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
> >  proceeds to STEP 4 (Slot Reset)
> >  
> > -STEP 3: Link Reset
> > ---
> > -The platform resets the link.  This is a PCI-Express specific step
> > -and is done whenever a fatal error has been detected that can be
> > -"solved" by resetting the link.
> > -
> >  STEP 4: Slot Reset
> >  --
> >  
> > -- 
> > 2.25.1
> > 


[powerpc:next] BUILD REGRESSION 4ebbbaa4ce8524b853dd6febf0176a6efa3482d7

2021-07-01 Thread kernel test robot
 randconfig-a005-20210630
i386 randconfig-a006-20210630
i386 randconfig-a015-20210701
i386 randconfig-a016-20210701
i386 randconfig-a011-20210701
i386 randconfig-a012-20210701
i386 randconfig-a013-20210701
i386 randconfig-a014-20210701
i386 randconfig-a014-20210630
i386 randconfig-a011-20210630
i386 randconfig-a016-20210630
i386 randconfig-a012-20210630
i386 randconfig-a013-20210630
i386 randconfig-a015-20210630
riscvnommu_k210_defconfig
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
um   x86_64_defconfig
um i386_defconfig
umkunit_defconfig
x86_64   allyesconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-b001-20210630
x86_64   randconfig-b001-20210702
x86_64   randconfig-a004-20210702
x86_64   randconfig-a005-20210702
x86_64   randconfig-a002-20210702
x86_64   randconfig-a006-20210702
x86_64   randconfig-a003-20210702
x86_64   randconfig-a001-20210702
x86_64   randconfig-a012-20210630
x86_64   randconfig-a015-20210630
x86_64   randconfig-a016-20210630
x86_64   randconfig-a013-20210630
x86_64   randconfig-a011-20210630
x86_64   randconfig-a014-20210630

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[powerpc:merge] BUILD SUCCESS e289c2e239c638cab7e71143e0a65c7c4a057ad7

2021-07-01 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: e289c2e239c638cab7e71143e0a65c7c4a057ad7  Automatic merge of 
'next' into merge (2021-07-01 23:06)

elapsed time: 729m

configs tested: 116
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
powerpcsam440ep_defconfig
powerpc powernv_defconfig
powerpc  acadia_defconfig
um  defconfig
sh  kfr2r09_defconfig
mips  bmips_stb_defconfig
powerpc  g5_defconfig
mips   bmips_be_defconfig
arm  ixp4xx_defconfig
armoxnas_v6_defconfig
arm axm55xx_defconfig
powerpc sequoia_defconfig
xtensa   common_defconfig
arm   spear13xx_defconfig
mips bigsur_defconfig
ia64generic_defconfig
arc nsimosci_hs_defconfig
xtensa  nommu_kc705_defconfig
mips loongson2k_defconfig
sh  sdk7780_defconfig
arm assabet_defconfig
m68kmvme147_defconfig
m68km5272c3_defconfig
powerpc mpc837x_rdb_defconfig
x86_64allnoconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a004-20210630
i386 randconfig-a001-20210630
i386 randconfig-a003-20210630
i386 randconfig-a002-20210630
i386 randconfig-a005-20210630
i386 randconfig-a006-20210630
x86_64   randconfig-a002-20210630
x86_64   randconfig-a001-20210630
x86_64   randconfig-a004-20210630
x86_64   randconfig-a005-20210630
x86_64   randconfig-a006-20210630
x86_64   randconfig-a003-20210630
i386 randconfig-a014-20210630
i386 randconfig-a011-20210630
i386 randconfig-a016-20210630
i386 randconfig-a012-20210630
i386 randconfig-a013-20210630
i386 randconfig-a015-20210630
i386 randconfig-a015-20210701
i386 randconfig-a016-20210701
i386 randconfig-a011-20210701
i386 randconfig-a012-20210701
i386 randconfig-a013-20210701
i386 randconfig-a014-20210701
riscvnommu_k210_defconfig
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
um   x86_64_defconfig
um i386_defconfig
umkunit_defconfig
x86_64   allyesconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64

Re: [PATCH 1/2] powerpc/bpf: Fix detecting BPF atomic instructions

2021-07-01 Thread Alexei Starovoitov
On Thu, Jul 1, 2021 at 12:32 PM Naveen N. Rao
 wrote:
>
> Alexei Starovoitov wrote:
> > On Thu, Jul 1, 2021 at 8:09 AM Naveen N. Rao
> >  wrote:
> >>
> >> Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
> >> atomics in .imm") converted BPF_XADD to BPF_ATOMIC and added a way to
> >> distinguish instructions based on the immediate field. Existing JIT
> >> implementations were updated to check for the immediate field and to
> >> reject programs utilizing anything more than BPF_ADD (such as BPF_FETCH)
> >> in the immediate field.
> >>
> >> However, the check added to powerpc64 JIT did not look at the correct
> >> BPF instruction. Due to this, such programs would be accepted and
> >> incorrectly JIT'ed resulting in soft lockups, as seen with the atomic
> >> bounds test. Fix this by looking at the correct immediate value.
> >>
> >> Fixes: 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other 
> >> atomics in .imm")
> >> Reported-by: Jiri Olsa 
> >> Tested-by: Jiri Olsa 
> >> Signed-off-by: Naveen N. Rao 
> >> ---
> >> Hi Jiri,
> >> FYI: I made a small change in this patch -- using 'imm' directly, rather
> >> than insn[i].imm. I've still added your Tested-by since this shouldn't
> >> impact the fix in any way.
> >>
> >> - Naveen
> >
> > Excellent debugging! You guys are awesome.
>
> Thanks. Jiri and Brendan did the bulk of the work :)
>
> > How do you want this fix routed? via bpf tree?
>
> Michael has a few BPF patches queued up in powerpc tree for v5.14, so it
> might be easier to take these patches through the powerpc tree unless he
> feels otherwise. Michael?

Works for me. Thanks!


Re: [PATCH 1/2] powerpc/bpf: Fix detecting BPF atomic instructions

2021-07-01 Thread Naveen N. Rao

Alexei Starovoitov wrote:

On Thu, Jul 1, 2021 at 8:09 AM Naveen N. Rao
 wrote:


Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and added a way to
distinguish instructions based on the immediate field. Existing JIT
implementations were updated to check for the immediate field and to
reject programs utilizing anything more than BPF_ADD (such as BPF_FETCH)
in the immediate field.

However, the check added to powerpc64 JIT did not look at the correct
BPF instruction. Due to this, such programs would be accepted and
incorrectly JIT'ed resulting in soft lockups, as seen with the atomic
bounds test. Fix this by looking at the correct immediate value.

Fixes: 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other atomics in 
.imm")
Reported-by: Jiri Olsa 
Tested-by: Jiri Olsa 
Signed-off-by: Naveen N. Rao 
---
Hi Jiri,
FYI: I made a small change in this patch -- using 'imm' directly, rather
than insn[i].imm. I've still added your Tested-by since this shouldn't
impact the fix in any way.

- Naveen


Excellent debugging! You guys are awesome.


Thanks. Jiri and Brendan did the bulk of the work :)


How do you want this fix routed? via bpf tree?


Michael has a few BPF patches queued up in powerpc tree for v5.14, so it 
might be easier to take these patches through the powerpc tree unless he 
feels otherwise. Michael?


This also needs to be tagged for stable:
Cc: sta...@vger.kernel.org # 5.12+


- Naveen


Re: [PATCH 2/2] powerpc/bpf: Reject atomic ops in ppc32 JIT

2021-07-01 Thread Naveen N. Rao

Christophe Leroy wrote:



Le 01/07/2021 à 17:08, Naveen N. Rao a écrit :

Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and updated all JIT
implementations to reject JIT'ing instructions with an immediate value
different from BPF_ADD. However, ppc32 BPF JIT was implemented around
the same time and didn't include the same change. Update the ppc32 JIT
accordingly.

Signed-off-by: Naveen N. Rao 


Shouldn't it also include a Fixes tag and stable Cc as PPC32 eBPF was added in 
5.13 ?


Yes, I wasn't sure which patch to actually blame. But you're right, this 
should have the below fixes tag since this affects the ppc32 eBPF JIT.




Fixes: 51c66ad849a7 ("powerpc/bpf: Implement extended BPF on PPC32")
Cc: sta...@vger.kernel.org


Cc: sta...@vger.kernel.org # v5.13


Thanks,
- Naveen



Re: [PATCH v2 3/4] powerpc: wii.dts: Expose the OTP on this platform

2021-07-01 Thread Emmanuel Gil Peyrot
On Sat, Jun 26, 2021 at 11:34:01PM +, Jonathan Neuschäfer wrote:
> On Wed, May 19, 2021 at 11:50:43AM +0200, Emmanuel Gil Peyrot wrote:
> > This can be used by the newly-added nintendo-otp nvmem module.
> > 
> > Signed-off-by: Emmanuel Gil Peyrot 
> > ---
> >  arch/powerpc/boot/dts/wii.dts | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/arch/powerpc/boot/dts/wii.dts b/arch/powerpc/boot/dts/wii.dts
> > index aaa381da1906..7837c4a3f09c 100644
> > --- a/arch/powerpc/boot/dts/wii.dts
> > +++ b/arch/powerpc/boot/dts/wii.dts
> > @@ -219,6 +219,11 @@ control@d800100 {
> > reg = <0x0d800100 0x300>;
> > };
> >  
> > +   otp@d8001ec {
> > +   compatible = "nintendo,hollywood-otp";
> > +   reg = <0x0d8001ec 0x8>;
> 
> The OTP registers overlap with the previous node, control@d800100.
> Not sure what's the best way to structure the devicetree in this case,
> maybe something roughly like the following (untested, unverified):
[snip]

I couldn’t get this to work, but additionally it looks like it should
start 0x100 earlier and contain pic1@d800030 and gpio@d8000c0, given
https://wiibrew.org/wiki/Hardware/Hollywood_Registers

Would it make sense, for the time being, to reduce the size of this
control@d800100 device to the single register currently being used by
arch/powerpc/platforms/embedded6xx/wii.c (0xd800194, used to reboot the
system) and leave the refactor of restart + OTP + PIC + GPIO for a
future series?

Thanks,

-- 
Emmanuel Gil Peyrot


signature.asc
Description: PGP signature


Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-01 Thread Nicholas Piggin
Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> "Christopher M. Riedl"  writes:
>> >>
>> >> > Switching to a different mm with Hash translation causes SLB entries to
>> >> > be preloaded from the current thread_info. This reduces SLB faults, for
>> >> > example when threads share a common mm but operate on different address
>> >> > ranges.
>> >> >
>> >> > Preloading entries from the thread_info struct may not always be
>> >> > appropriate - such as when switching to a temporary mm. Introduce a new
>> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
>> >> > SLB preload code into a separate function since switch_slb() is already
>> >> > quite long. The default behavior (preloading SLB entries from the
>> >> > current thread_info struct) remains unchanged.
>> >> >
>> >> > Signed-off-by: Christopher M. Riedl 
>> >> >
>> >> > ---
>> >> >
>> >> > v4:  * New to series.
>> >> > ---
>> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
>> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 ++--
>> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >
>> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
>> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> > u32 pkey_allocation_map;
>> >> > s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >  #endif
>> >> > +
>> >> > +   /* Do not preload SLB entries from thread_info during 
>> >> > switch_slb() */
>> >> > +   bool skip_slb_preload;
>> >> >  } mm_context_t;
>> >> >  
>> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
>> >> > b/arch/powerpc/include/asm/mmu_context.h
>> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct 
>> >> > *oldmm,
>> >> > return 0;
>> >> >  }
>> >> >  
>> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> > +{
>> >> > +   mm->context.skip_slb_preload = true;
>> >> > +}
>> >> > +
>> >> > +#else
>> >> > +
>> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> > +
>> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> > +
>> >> >  #include 
>> >> >  
>> >> >  #endif /* __KERNEL__ */
>> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
>> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, 
>> >> > struct mm_struct *mm)
>> >> > atomic_set(>context.active_cpus, 0);
>> >> > atomic_set(>context.copros, 0);
>> >> >  
>> >> > +   mm->context.skip_slb_preload = false;
>> >> > +
>> >> > return 0;
>> >> >  }
>> >> >  
>> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
>> >> > b/arch/powerpc/mm/book3s64/slb.c
>> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
>> >> > index)
>> >> > asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >  }
>> >> >  
>> >> > +static void preload_slb_entries(struct task_struct *tsk, struct 
>> >> > mm_struct *mm)
>> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> switch_slb is probably a fairly hot path on hash?
>> > 
>> > Yes absolutely. I'll make this change in v5.
>> > 
>> >>
>> >> > +{
>> >> > +   struct thread_info *ti = task_thread_info(tsk);
>> >> > +   unsigned char i;
>> >> > +
>> >> > +   /*
>> >> > +* We gradually age out SLBs after a number of context switches 
>> >> > to
>> >> > +* reduce reload overhead of unused entries (like we do with 
>> >> > FP/VEC
>> >> > +* reload). Each time we wrap 256 switches, take an entry out 
>> >> > of the
>> >> > +* SLB preload cache.
>> >> > +*/
>> >> > +   tsk->thread.load_slb++;
>> >> > +   if (!tsk->thread.load_slb) {
>> >> > +   unsigned long pc = KSTK_EIP(tsk);
>> >> > +
>> >> > +   preload_age(ti);
>> >> > +   

Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-01 Thread Christopher M. Riedl
On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> > the Book3s64 Hash MMU operates - by default the space above
> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> > all platforms/MMUs is randomized inside this range.  The number of
> > possible random addresses is dependent on PAGE_SIZE and limited by
> > DEFAULT_MAP_WINDOW.
> > 
> > Bits of entropy with 64K page size on BOOK3S_64:
> > 
> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > 
> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > bits of entropy = log2(128TB / 64K) bits of entropy = 31
> > 
> > Randomization occurs only once during initialization at boot.
> > 
> > Introduce two new functions, map_patch() and unmap_patch(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> > the page for patching with PAGE_SHARED since the kernel cannot access
> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> > 
> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> > taking an SLB and Hash fault during patching.
>
> What prevents the SLBE or HPTE from being removed before the last
> access?

This code runs with local IRQs disabled - we also don't access anything
else in userspace so I'm not sure what else could cause the entries to
be removed TBH.

>
>
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -   struct vm_struct *area;
> > +   int err;
> >  
> > -   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -   if (!area) {
> > -   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -   cpu);
> > -   return -1;
> > -   }
> > -   this_cpu_write(text_poke_area, area);
> > +   if (radix_enabled())
> > +   return 0;
> >  
> > -   return 0;
> > -}
> > +   err = slb_allocate_user(patching_mm, patching_addr);
> > +   if (err)
> > +   pr_warn("map patch: failed to allocate slb entry\n");
> >  
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -   free_vm_area(this_cpu_read(text_poke_area));
> > -   return 0;
> > +   err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +  HPTE_USE_KERNEL_KEY);
> > +   if (err)
> > +   pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +   /* See comment in switch_slb() in mm/book3s64/slb.c */
> > +   isync();
>
> I'm not sure if this is enough. Could we context switch here? You've
> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> How about taking an machine check that clears the SLB? Could the HPTE
> get removed by something else here?

All of this happens after a local_irq_save() which should at least
prevent context switches IIUC. I am not sure what else could cause the
HPTE to get removed here.

>
> You want to prevent faults because you might be patching a fault
> handler?

In a more general sense: I don't think we want to take page faults every
time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
fault handler codepath also checks `current->mm` in some places which
won't match the temporary mm. Also `current->mm` can be NULL which
caused problems in my earlier revisions of this series.

>
> Thanks,
> Nick



[PATCH 2/2] powerpc/numa: Update cpu_cpu_map on CPU online/offline

2021-07-01 Thread Srikar Dronamraju
cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER:
0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  |  7 ++-
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6c6e4d934d86..e562cca13d66 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
+   unmap_cpu_from_node(cpu);
+
if (shared_caches)
mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 * This CPU will not be in the online mask yet so we need to manually
 * add it to it's own thread sibling mask.
 

[PATCH 1/2] powerpc/numa: Print debug statements only when required

2021-07-01 Thread Srikar Dronamraju
Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6d0d89127190..f68dbe4e982c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   dbg("adding cpu %d to node %d\n", cpu, node);
 
-   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+   dbg("adding cpu %d to node %d\n", cpu, node);
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+   }
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   dbg("removing cpu %lu from node %d\n", cpu, node);
-
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+   dbg("removing cpu %lu from node %d\n", cpu, node);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-  cpu, node);
+   pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
-- 
2.27.0



Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-01 Thread Nicholas Piggin
Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> When code patching a STRICT_KERNEL_RWX kernel the page containing the
> address to be patched is temporarily mapped as writeable. Currently, a
> per-cpu vmalloc patch area is used for this purpose. While the patch
> area is per-cpu, the temporary page mapping is inserted into the kernel
> page tables for the duration of patching. The mapping is exposed to CPUs
> other than the patching CPU - this is undesirable from a hardening
> perspective. Use a temporary mm instead which keeps the mapping local to
> the CPU doing the patching.
> 
> Use the `poking_init` init hook to prepare a temporary mm and patching
> address. Initialize the temporary mm by copying the init mm. Choose a
> randomized patching address inside the temporary mm userspace address
> space. The patching address is randomized between PAGE_SIZE and
> DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> the Book3s64 Hash MMU operates - by default the space above
> DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> all platforms/MMUs is randomized inside this range.  The number of
> possible random addresses is dependent on PAGE_SIZE and limited by
> DEFAULT_MAP_WINDOW.
> 
> Bits of entropy with 64K page size on BOOK3S_64:
> 
> bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> 
> PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> bits of entropy = log2(128TB / 64K) bits of entropy = 31
> 
> Randomization occurs only once during initialization at boot.
> 
> Introduce two new functions, map_patch() and unmap_patch(), to
> respectively create and remove the temporary mapping with write
> permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> the page for patching with PAGE_SHARED since the kernel cannot access
> userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> 
> Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> for the patching_addr when using the Hash MMU on Book3s64 to avoid
> taking an SLB and Hash fault during patching.

What prevents the SLBE or HPTE from being removed before the last
access?


> +#ifdef CONFIG_PPC_BOOK3S_64
> +
> +static inline int hash_prefault_mapping(pgprot_t pgprot)
>  {
> - struct vm_struct *area;
> + int err;
>  
> - area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> - if (!area) {
> - WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> - cpu);
> - return -1;
> - }
> - this_cpu_write(text_poke_area, area);
> + if (radix_enabled())
> + return 0;
>  
> - return 0;
> -}
> + err = slb_allocate_user(patching_mm, patching_addr);
> + if (err)
> + pr_warn("map patch: failed to allocate slb entry\n");
>  
> -static int text_area_cpu_down(unsigned int cpu)
> -{
> - free_vm_area(this_cpu_read(text_poke_area));
> - return 0;
> + err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> +HPTE_USE_KERNEL_KEY);
> + if (err)
> + pr_warn("map patch: failed to insert hashed page\n");
> +
> + /* See comment in switch_slb() in mm/book3s64/slb.c */
> + isync();

I'm not sure if this is enough. Could we context switch here? You've
got the PTL so no with a normal kernel but maybe yes with an RT kernel
How about taking an machine check that clears the SLB? Could the HPTE
get removed by something else here?

You want to prevent faults because you might be patching a fault 
handler?

Thanks,
Nick


Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-01 Thread Christopher M. Riedl
On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> "Christopher M. Riedl"  writes:
> >> >>
> >> >> > Switching to a different mm with Hash translation causes SLB entries 
> >> >> > to
> >> >> > be preloaded from the current thread_info. This reduces SLB faults, 
> >> >> > for
> >> >> > example when threads share a common mm but operate on different 
> >> >> > address
> >> >> > ranges.
> >> >> >
> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> > appropriate - such as when switching to a temporary mm. Introduce a 
> >> >> > new
> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move 
> >> >> > the
> >> >> > SLB preload code into a separate function since switch_slb() is 
> >> >> > already
> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> > current thread_info struct) remains unchanged.
> >> >> >
> >> >> > Signed-off-by: Christopher M. Riedl 
> >> >> >
> >> >> > ---
> >> >> >
> >> >> > v4:  * New to series.
> >> >> > ---
> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 
> >> >> > ++--
> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> >> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >   u32 pkey_allocation_map;
> >> >> >   s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >  #endif
> >> >> > +
> >> >> > + /* Do not preload SLB entries from thread_info during 
> >> >> > switch_slb() */
> >> >> > + bool skip_slb_preload;
> >> >> >  } mm_context_t;
> >> >> >  
> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> >> >> > b/arch/powerpc/include/asm/mmu_context.h
> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct 
> >> >> > *oldmm,
> >> >> >   return 0;
> >> >> >  }
> >> >> >  
> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> > +{
> >> >> > + mm->context.skip_slb_preload = true;
> >> >> > +}
> >> >> > +
> >> >> > +#else
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> >> > +
> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> >> > +
> >> >> >  #include 
> >> >> >  
> >> >> >  #endif /* __KERNEL__ */
> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
> >> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > index c10fc8a72fb37..3479910264c59 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, 
> >> >> > struct mm_struct *mm)
> >> >> >   atomic_set(>context.active_cpus, 0);
> >> >> >   atomic_set(>context.copros, 0);
> >> >> >  
> >> >> > + mm->context.skip_slb_preload = false;
> >> >> > +
> >> >> >   return 0;
> >> >> >  }
> >> >> >  
> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
> >> >> > b/arch/powerpc/mm/book3s64/slb.c
> >> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
> >> >> > index)
> >> >> >   asm volatile("slbie %0" : : "r" (slbie_data));
> >> >> >  }
> >> >> >  
> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct 
> >> >> > mm_struct *mm)
> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
> >> >> switch_slb is probably a fairly hot path on hash?
> >> > 
> >> > Yes absolutely. I'll make this change in v5.
> >> > 
> >> >>
> >> >> > +{
> >> >> > + struct thread_info *ti = task_thread_info(tsk);
> >> >> > + unsigned char i;
> >> >> > +
> >> >> > + /*
> >> >> > +  * We gradually age out SLBs after a number of context switches 
> >> >> > to
> >> >> > +  * reduce reload overhead of unused entries (like we do with 
> >> >> > FP/VEC
> >> >> > +  * reload). Each time we wrap 

Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-01 Thread Will Deacon
On Wed, Jun 30, 2021 at 08:56:51AM -0700, Nathan Chancellor wrote:
> On Wed, Jun 30, 2021 at 12:43:48PM +0100, Will Deacon wrote:
> > On Wed, Jun 30, 2021 at 05:17:27PM +0800, Claire Chang wrote:
> > > `BUG: unable to handle page fault for address: 003a8290` and
> > > the fact it crashed at `_raw_spin_lock_irqsave` look like the memory
> > > (maybe dev->dma_io_tlb_mem) was corrupted?
> > > The dev->dma_io_tlb_mem should be set here
> > > (https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/pci/probe.c#n2528)
> > > through device_initialize.
> > 
> > I'm less sure about this. 'dma_io_tlb_mem' should be pointing at
> > 'io_tlb_default_mem', which is a page-aligned allocation from memblock.
> > The spinlock is at offset 0x24 in that structure, and looking at the
> > register dump from the crash:
> > 
> > Jun 29 18:28:42 hp-4300G kernel: RSP: 0018:adb4013db9e8 EFLAGS: 00010006
> > Jun 29 18:28:42 hp-4300G kernel: RAX: 003a8290 RBX: 
> >  RCX: 8900572ad580
> > Jun 29 18:28:42 hp-4300G kernel: RDX: 89005653f024 RSI: 
> > 000c RDI: 1d17
> > Jun 29 18:28:42 hp-4300G kernel: RBP: 0a20d000 R08: 
> > 000c R09: 
> > Jun 29 18:28:42 hp-4300G kernel: R10: 0a20d000 R11: 
> > 89005653f000 R12: 0212
> > Jun 29 18:28:42 hp-4300G kernel: R13: 1000 R14: 
> > 0002 R15: 0020
> > Jun 29 18:28:42 hp-4300G kernel: FS:  7f1f8898ea40() 
> > GS:89005728() knlGS:
> > Jun 29 18:28:42 hp-4300G kernel: CS:  0010 DS:  ES:  CR0: 
> > 80050033
> > Jun 29 18:28:42 hp-4300G kernel: CR2: 003a8290 CR3: 
> > 0001020d CR4: 00350ee0
> > Jun 29 18:28:42 hp-4300G kernel: Call Trace:
> > Jun 29 18:28:42 hp-4300G kernel:  _raw_spin_lock_irqsave+0x39/0x50
> > Jun 29 18:28:42 hp-4300G kernel:  swiotlb_tbl_map_single+0x12b/0x4c0
> > 
> > Then that correlates with R11 holding the 'dma_io_tlb_mem' pointer and
> > RDX pointing at the spinlock. Yet RAX is holding junk :/
> > 
> > I agree that enabling KASAN would be a good idea, but I also think we
> > probably need to get some more information out of swiotlb_tbl_map_single()
> > to see see what exactly is going wrong in there.
> 
> I can certainly enable KASAN and if there is any debug print I can add
> or dump anything, let me know!

I bit the bullet and took v5.13 with swiotlb/for-linus-5.14 merged in, built
x86 defconfig and ran it on my laptop. However, it seems to work fine!

Please can you share your .config?

Will


Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-01 Thread Nicholas Piggin
Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
> On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
>> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>> >> >> "Christopher M. Riedl"  writes:
>> >> >>
>> >> >> > Switching to a different mm with Hash translation causes SLB entries 
>> >> >> > to
>> >> >> > be preloaded from the current thread_info. This reduces SLB faults, 
>> >> >> > for
>> >> >> > example when threads share a common mm but operate on different 
>> >> >> > address
>> >> >> > ranges.
>> >> >> >
>> >> >> > Preloading entries from the thread_info struct may not always be
>> >> >> > appropriate - such as when switching to a temporary mm. Introduce a 
>> >> >> > new
>> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move 
>> >> >> > the
>> >> >> > SLB preload code into a separate function since switch_slb() is 
>> >> >> > already
>> >> >> > quite long. The default behavior (preloading SLB entries from the
>> >> >> > current thread_info struct) remains unchanged.
>> >> >> >
>> >> >> > Signed-off-by: Christopher M. Riedl 
>> >> >> >
>> >> >> > ---
>> >> >> >
>> >> >> > v4:  * New to series.
>> >> >> > ---
>> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
>> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>> >> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 
>> >> >> > ++--
>> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>> >> >> >
>> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
>> >> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>> >> >> > @@ -130,6 +130,9 @@ typedef struct {
>> >> >> >  u32 pkey_allocation_map;
>> >> >> >  s16 execute_only_pkey; /* key holding execute-only protection */
>> >> >> >  #endif
>> >> >> > +
>> >> >> > +/* Do not preload SLB entries from thread_info during 
>> >> >> > switch_slb() */
>> >> >> > +bool skip_slb_preload;
>> >> >> >  } mm_context_t;
>> >> >> >  
>> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
>> >> >> > b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct 
>> >> >> > mm_struct *oldmm,
>> >> >> >  return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>> >> >> > +{
>> >> >> > +mm->context.skip_slb_preload = true;
>> >> >> > +}
>> >> >> > +
>> >> >> > +#else
>> >> >> > +
>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>> >> >> > +
>> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>> >> >> > +
>> >> >> >  #include 
>> >> >> >  
>> >> >> >  #endif /* __KERNEL__ */
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
>> >> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > index c10fc8a72fb37..3479910264c59 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, 
>> >> >> > struct mm_struct *mm)
>> >> >> >  atomic_set(>context.active_cpus, 0);
>> >> >> >  atomic_set(>context.copros, 0);
>> >> >> >  
>> >> >> > +mm->context.skip_slb_preload = false;
>> >> >> > +
>> >> >> >  return 0;
>> >> >> >  }
>> >> >> >  
>> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
>> >> >> > b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > index c91bd85eb90e3..da0836cb855af 100644
>> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
>> >> >> > index)
>> >> >> >  asm volatile("slbie %0" : : "r" (slbie_data));
>> >> >> >  }
>> >> >> >  
>> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct 
>> >> >> > mm_struct *mm)
>> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
>> >> >> switch_slb is probably a fairly hot path on hash?
>> >> > 
>> >> > Yes absolutely. I'll make this change in v5.
>> >> > 
>> >> >>
>> >> >> > +{
>> >> >> > +struct thread_info *ti = task_thread_info(tsk);
>> >> >> > +unsigned char i;
>> >> >> > +
>> >> >> > +/*
>> >> >> > + * We gradually age out SLBs after a number 

Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-01 Thread Nicholas Piggin
Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
>> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
>> > address to be patched is temporarily mapped as writeable. Currently, a
>> > per-cpu vmalloc patch area is used for this purpose. While the patch
>> > area is per-cpu, the temporary page mapping is inserted into the kernel
>> > page tables for the duration of patching. The mapping is exposed to CPUs
>> > other than the patching CPU - this is undesirable from a hardening
>> > perspective. Use a temporary mm instead which keeps the mapping local to
>> > the CPU doing the patching.
>> > 
>> > Use the `poking_init` init hook to prepare a temporary mm and patching
>> > address. Initialize the temporary mm by copying the init mm. Choose a
>> > randomized patching address inside the temporary mm userspace address
>> > space. The patching address is randomized between PAGE_SIZE and
>> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
>> > the Book3s64 Hash MMU operates - by default the space above
>> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
>> > all platforms/MMUs is randomized inside this range.  The number of
>> > possible random addresses is dependent on PAGE_SIZE and limited by
>> > DEFAULT_MAP_WINDOW.
>> > 
>> > Bits of entropy with 64K page size on BOOK3S_64:
>> > 
>> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>> > 
>> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>> > bits of entropy = log2(128TB / 64K) bits of entropy = 31
>> > 
>> > Randomization occurs only once during initialization at boot.
>> > 
>> > Introduce two new functions, map_patch() and unmap_patch(), to
>> > respectively create and remove the temporary mapping with write
>> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
>> > the page for patching with PAGE_SHARED since the kernel cannot access
>> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
>> > 
>> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
>> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
>> > taking an SLB and Hash fault during patching.
>>
>> What prevents the SLBE or HPTE from being removed before the last
>> access?
> 
> This code runs with local IRQs disabled - we also don't access anything
> else in userspace so I'm not sure what else could cause the entries to
> be removed TBH.
> 
>>
>>
>> > +#ifdef CONFIG_PPC_BOOK3S_64
>> > +
>> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
>> >  {
>> > -  struct vm_struct *area;
>> > +  int err;
>> >  
>> > -  area = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> > -  if (!area) {
>> > -  WARN_ONCE(1, "Failed to create text area for cpu %d\n",
>> > -  cpu);
>> > -  return -1;
>> > -  }
>> > -  this_cpu_write(text_poke_area, area);
>> > +  if (radix_enabled())
>> > +  return 0;
>> >  
>> > -  return 0;
>> > -}
>> > +  err = slb_allocate_user(patching_mm, patching_addr);
>> > +  if (err)
>> > +  pr_warn("map patch: failed to allocate slb entry\n");
>> >  
>> > -static int text_area_cpu_down(unsigned int cpu)
>> > -{
>> > -  free_vm_area(this_cpu_read(text_poke_area));
>> > -  return 0;
>> > +  err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
>> > + HPTE_USE_KERNEL_KEY);
>> > +  if (err)
>> > +  pr_warn("map patch: failed to insert hashed page\n");
>> > +
>> > +  /* See comment in switch_slb() in mm/book3s64/slb.c */
>> > +  isync();
>>
>> I'm not sure if this is enough. Could we context switch here? You've
>> got the PTL so no with a normal kernel but maybe yes with an RT kernel
>> How about taking an machine check that clears the SLB? Could the HPTE
>> get removed by something else here?
> 
> All of this happens after a local_irq_save() which should at least
> prevent context switches IIUC.

Ah yeah I didn't look that far back. A machine check can take out SLB
entries.

> I am not sure what else could cause the
> HPTE to get removed here.

Other CPUs?

>> You want to prevent faults because you might be patching a fault
>> handler?
> 
> In a more general sense: I don't think we want to take page faults every
> time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> fault handler codepath also checks `current->mm` in some places which
> won't match the temporary mm. Also `current->mm` can be NULL which
> caused problems in my earlier revisions of this series.

Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and 
setting current->mm properly would explode too much. Maybe that's okayish.
But I can't see how the HPT code is up to the job of this in general 
(even if that current->mm issue was fixed).

To do it without holes you would either 

Re: [PATCH v5] pseries: prevent free CPU ids to be reused on another node

2021-07-01 Thread Laurent Dufour

Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 29/04/2021 à 19:49, Laurent Dufour a écrit :

When a CPU is hot added, the CPU ids are taken from the available mask from
the lower possible set. If that set of values was previously used for CPU
attached to a different node, this seems to application like if these CPUs
have migrated from a node to another one which is not expected in real
life.

To prevent this, it is needed to record the CPU ids used for each node and
to not reuse them on another node. However, to prevent CPU hot plug to
fail, in the case the CPU ids is starved on a node, the capability to reuse
other nodes’ free CPU ids is kept. A warning is displayed in such a case
to warn the user.

A new CPU bit mask (node_recorded_ids_map) is introduced for each possible
node. It is populated with the CPU onlined at boot time, and then when a
CPU is hot plug to a node. The bits in that mask remain when the CPU is hot
unplugged, to remind this CPU ids have been used for this node.

If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.

The effect of this patch can be seen by removing and adding CPUs using the
Qemu monitor. In the following case, the first CPU from the node 2 is
removed, then the first one from the node 1 is removed too. Later, the
first CPU of the node 2 is added back. Without that patch, the kernel will
numbered these CPUs using the first CPU ids available which are the ones
freed when removing the second CPU of the node 0. This leads to the CPU ids
16-23 to move from the node 1 to the node 2. With the patch applied, the
CPU ids 32-39 are used since they are the lowest free ones which have not
been used on another node.

At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47

Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Signed-off-by: Laurent Dufour 
---
V5:
  - Rework code structure
  - Reintroduce the capability to reuse other node's ids.
V4: addressing Nathan's comment
  - Rename the local variable named 'nid' into 'assigned_node'
V3: addressing Nathan's comments
  - Remove the retry feature
  - Reduce the number of local variables (removing 'i')
  - Add comment about the cpu_add_remove_lock protecting the added CPU mask.
  V2: (no functional changes)
  - update the test's output in the commit's description
  - node_recorded_ids_map should be static
---
  arch/powerpc/platforms/pseries/hotplug-cpu.c | 171 ++-
  1 file changed, 132 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 7e970f81d8ff..e1f224320102 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -39,6 +39,12 @@
  /* This version can't take the spinlock, because it never returns */
  static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
  
+/*

+ * Record the CPU ids used on each nodes.
+ * Protected by cpu_add_remove_lock.
+ */
+static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
+
  static void rtas_stop_self(void)
  {
static struct rtas_args args;
@@ -139,72 +145,148 @@ static void pseries_cpu_die(unsigned int cpu)
paca_ptrs[cpu]->cpu_start = 0;
  }
  
+/**

+ * find_cpu_id_range - found a linear ranger of @nthreads free CPU ids.
+ * @nthreads : the number of threads (cpu ids)
+ * @assigned_node : the node it belongs to or NUMA_NO_NODE if free ids from any
+ *  node can be peek.
+ * @cpu_mask: the returned CPU mask.
+ *
+ * Returns 0 on success.
+ */
+static int find_cpu_id_range(unsigned int nthreads, int assigned_node,
+cpumask_var_t *cpu_mask)
+{
+   cpumask_var_t candidate_mask;
+   unsigned int cpu, node;
+   int rc = -ENOSPC;
+
+   if (!zalloc_cpumask_var(_mask, GFP_KERNEL))
+   return -ENOMEM;
+
+   cpumask_clear(*cpu_mask);
+   for (cpu = 0; cpu < nthreads; cpu++)
+   cpumask_set_cpu(cpu, *cpu_mask);
+
+   BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
+
+   /* Get a bitmap of unoccupied slots. */
+   cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
+
+   if (assigned_node != NUMA_NO_NODE) {
+   /*
+* Remove 

Re: [PATCH v2] ppc64/numa: consider the max numa node for migratable LPAR

2021-07-01 Thread Laurent Dufour

Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 11/05/2021 à 09:31, Laurent Dufour a écrit :

When a LPAR is migratable, we should consider the maximum possible NUMA
node instead the number of NUMA node from the actual system.

The DT property 'ibm,current-associativity-domains' is defining the maximum
number of nodes the LPAR can see when running on that box. But if the LPAR
is being migrated on another box, it may seen up to the nodes defined by
'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
should be used.

Unfortunately, there is no easy way to know if a LPAR is migratable or
not. The hypervisor is exporting the property 'ibm,migratable-partition' in
the case it set to migrate partition, but that would not mean that the
current partition is migratable.

Without this patch, when a LPAR is started on a 2 nodes box and then
migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
will be wrongly assigned to the node because the kernel has been set to use
up to 2 nodes (the configuration of the departure node). With this patch
applies, the CPU is correctly added to the 3rd node.

Fixes: f9f130ff2ec9 ("powerpc/numa: Detect support for coregroup")
Reviewed-by: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
V2: Address Srikar's comments
  - Fix the commit message
  - Use pr_info instead printk(KERN_INFO..)
---
  arch/powerpc/mm/numa.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..094a1076fd1f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
  static void __init find_possible_nodes(void)
  {
struct device_node *rtas;
-   const __be32 *domains;
+   const __be32 *domains = NULL;
int prop_length, max_nodes;
u32 i;
  
@@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)

 * it doesn't exist, then fallback on ibm,max-associativity-domains.
 * Current denotes what the platform can support compared to max
 * which denotes what the Hypervisor can support.
+*
+* If the LPAR is migratable, new nodes might be activated after a LPM,
+* so we should consider the max number in that case.
 */
-   domains = of_get_property(rtas, "ibm,current-associativity-domains",
-   _length);
+   if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
+   domains = of_get_property(rtas,
+ "ibm,current-associativity-domains",
+ _length);
if (!domains) {
domains = of_get_property(rtas, "ibm,max-associativity-domains",
_length);
@@ -920,6 +925,8 @@ static void __init find_possible_nodes(void)
}
  
  	max_nodes = of_read_number([min_common_depth], 1);

+   pr_info("Partition configured for %d NUMA nodes.\n", max_nodes);
+
for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);





Re: [PATCH v5] pseries/drmem: update LMBs after LPM

2021-07-01 Thread Laurent Dufour

Hi Michael,

Do you mind taking this patch of 5.14?

Thanks,
Laurent.

Le 17/05/2021 à 11:06, Laurent Dufour a écrit :

After a LPM, the device tree node ibm,dynamic-reconfiguration-memory may be
updated by the hypervisor in the case the NUMA topology of the LPAR's
memory is updated.

This is handled by the kernel, but the memory's node is not updated because
there is no way to move a memory block between nodes from the Linux kernel
point of view.

If later a memory block is added or removed, drmem_update_dt() is called
and it is overwriting the DT node ibm,dynamic-reconfiguration-memory to
match the added or removed LMB. But the LMB's associativity node has not
been updated after the DT node update and thus the node is overwritten by
the Linux's topology instead of the hypervisor one.

Introduce a hook called when the ibm,dynamic-reconfiguration-memory node is
updated to force an update of the LMB's associativity. However, ignore the
call to that hook when the update has been triggered by drmem_update_dt().
Because, in that case, the LMB tree has been used to set the DT property
and thus it doesn't need to be updated back. Since drmem_update_dt() is
called under the protection of the device_hotplug_lock and the hook is
called in the same context, use a simple boolean variable to detect that
call.

Cc: Nathan Lynch 
Cc: Aneesh Kumar K.V 
Cc: Tyrel Datwyler 
Signed-off-by: Laurent Dufour 
---

V5:
  - Reword the commit's description to address Nathan's comments.
V4:
  - Prevent the LMB to be updated back in the case the request came from the
  LMB tree's update.
V3:
  - Check rd->dn->name instead of rd->dn->full_name
V2:
  - Take Tyrel's idea to rely on OF_RECONFIG_UPDATE_PROPERTY instead of
  introducing a new hook mechanism.
---
  arch/powerpc/include/asm/drmem.h  |  1 +
  arch/powerpc/mm/drmem.c   | 46 +++
  .../platforms/pseries/hotplug-memory.c|  4 ++
  3 files changed, 51 insertions(+)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index bf2402fed3e0..4265d5e95c2c 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -111,6 +111,7 @@ int drmem_update_dt(void);
  int __init
  walk_drmem_lmbs_early(unsigned long node, void *data,
  int (*func)(struct drmem_lmb *, const __be32 **, void *));
+void drmem_update_lmbs(struct property *prop);
  #endif
  
  static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)

diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9af3832c9d8d..22197b18d85e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -18,6 +18,7 @@ static int n_root_addr_cells, n_root_size_cells;
  
  static struct drmem_lmb_info __drmem_info;

  struct drmem_lmb_info *drmem_info = &__drmem_info;
+static bool in_drmem_update;
  
  u64 drmem_lmb_memory_max(void)

  {
@@ -178,6 +179,11 @@ int drmem_update_dt(void)
if (!memory)
return -1;
  
+	/*

+* Set in_drmem_update to prevent the notifier callback to process the
+* DT property back since the change is coming from the LMB tree.
+*/
+   in_drmem_update = true;
prop = of_find_property(memory, "ibm,dynamic-memory", NULL);
if (prop) {
rc = drmem_update_dt_v1(memory, prop);
@@ -186,6 +192,7 @@ int drmem_update_dt(void)
if (prop)
rc = drmem_update_dt_v2(memory, prop);
}
+   in_drmem_update = false;
  
  	of_node_put(memory);

return rc;
@@ -307,6 +314,45 @@ int __init walk_drmem_lmbs_early(unsigned long node, void 
*data,
return ret;
  }
  
+/*

+ * Update the LMB associativity index.
+ */
+static int update_lmb(struct drmem_lmb *updated_lmb,
+ __maybe_unused const __be32 **usm,
+ __maybe_unused void *data)
+{
+   struct drmem_lmb *lmb;
+
+   for_each_drmem_lmb(lmb) {
+   if (lmb->drc_index != updated_lmb->drc_index)
+   continue;
+
+   lmb->aa_index = updated_lmb->aa_index;
+   break;
+   }
+   return 0;
+}
+
+/*
+ * Update the LMB associativity index.
+ *
+ * This needs to be called when the hypervisor is updating the
+ * dynamic-reconfiguration-memory node property.
+ */
+void drmem_update_lmbs(struct property *prop)
+{
+   /*
+* Don't update the LMBs if triggered by the update done in
+* drmem_update_dt(), the LMB values have been used to the update the DT
+* property in that case.
+*/
+   if (in_drmem_update)
+   return;
+   if (!strcmp(prop->name, "ibm,dynamic-memory"))
+   __walk_drmem_v1_lmbs(prop->value, NULL, NULL, update_lmb);
+   else if (!strcmp(prop->name, "ibm,dynamic-memory-v2"))
+   __walk_drmem_v2_lmbs(prop->value, NULL, NULL, update_lmb);
+}
  #endif
  
  static int init_drmem_lmb_size(struct 

Re: [RFC PATCH 38/43] KVM: PPC: Book3S HV P9: Test dawr_enabled() before saving host DAWR SPRs

2021-07-01 Thread Nicholas Piggin
Excerpts from Fabiano Rosas's message of July 1, 2021 3:51 am:
> Nicholas Piggin  writes:
> 
>> Some of the DAWR SPR access is already predicated on dawr_enabled(),
>> apply this to the remainder of the accesses.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/kvm/book3s_hv_p9_entry.c | 34 ---
>>  1 file changed, 20 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
>> b/arch/powerpc/kvm/book3s_hv_p9_entry.c
>> index 7aa72efcac6c..f305d1d6445c 100644
>> --- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
>> +++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
>> @@ -638,13 +638,16 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
>> time_limit, unsigned long lpc
>>
>>  host_hfscr = mfspr(SPRN_HFSCR);
>>  host_ciabr = mfspr(SPRN_CIABR);
>> -host_dawr0 = mfspr(SPRN_DAWR0);
>> -host_dawrx0 = mfspr(SPRN_DAWRX0);
>>  host_psscr = mfspr(SPRN_PSSCR);
>>  host_pidr = mfspr(SPRN_PID);
>> -if (cpu_has_feature(CPU_FTR_DAWR1)) {
>> -host_dawr1 = mfspr(SPRN_DAWR1);
>> -host_dawrx1 = mfspr(SPRN_DAWRX1);
>> +
>> +if (dawr_enabled()) {
>> +host_dawr0 = mfspr(SPRN_DAWR0);
>> +host_dawrx0 = mfspr(SPRN_DAWRX0);
>> +if (cpu_has_feature(CPU_FTR_DAWR1)) {
>> +host_dawr1 = mfspr(SPRN_DAWR1);
>> +host_dawrx1 = mfspr(SPRN_DAWRX1);
> 
> The userspace needs to enable DAWR1 via KVM_CAP_PPC_DAWR1. That cap is
> not even implemented in QEMU currently, so we never allow the guest to
> set vcpu->arch.dawr1. If we check for kvm->arch.dawr1_enabled instead of
> the CPU feature, we could shave some more time here.

Ah good point, yes let's do that.

Thanks,
Nick


[PATCH 0/2] Update cpu_cpu_mask on CPU online/offline

2021-07-01 Thread Srikar Dronamraju
When simultaneously running CPU online/offline with CPU add/remove in a
loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER: 0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 

Srikar Dronamraju (2):
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU online/offline

 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  | 18 +++---
 3 files changed, 22 insertions(+), 11 deletions(-)

-- 
2.27.0



[PATCH v2 00/32] powerpc: Add MSI IRQ domains to PCI drivers

2021-07-01 Thread Cédric Le Goater
Hello,

This series adds support for MSI IRQ domains on top of the XICS (P8)
and XIVE (P9/P10) IRQ domains for the PowerNV (baremetal) and pSeries
(VM) platforms. It should simplify and improve IRQ affinity of PCI
MSIs under these PowerPC platforms, specially for drivers distributing
multiple RX/TX queues on the different CPUs of the system.

Data locality can still be improved with an interrupt controller node
per chip but this requires FW changes. It could be done under OPAL.

The patchset has a large impact but it is well contained under the MSI
support. Initial tests were done on the P8, P9 and P10 PowerNV and
pSeries platforms, under the KVM and PowerVM hypervisor. PCI passthrough
was tested on P8/KVM, P9/KVM and P9/pVM with both interrupt modes.

P8 passthrough has some optimization to EOI MSIs when under real mode :

 e3c13e56a471 ("KVM: PPC: Book3S HV: Handle passthrough interrupts in guest")
 5d375199ea96 ("KVM: PPC: Book3S HV: Set server for passed-through interrupts")

They give us a ~10% bandwidth improvement on some 100G adapters
(Thanks Alexey), so it's good to keep but they require access to the
low level IRQ domain of the machine. It should be possible to rework
the code and use the MSI IRQ domains instead but for now, it's simpler
to keep the bypass. That can come later.

The P8/CAPI driver is also impacted. Tests were done on a Firestone
system with a memory AFU.

Thanks,

C.

Changes since v2 :

 - Included some CONFIG_IRQ_DOMAIN_HIERARCHY ifdefs
 - Microwatt fixes for ICS native
 - Removed irqd_is_started() check when setting the affinity

Cédric Le Goater (32):
  powerpc/pseries/pci: Introduce __find_pe_total_msi()
  powerpc/pseries/pci: Introduce rtas_prepare_msi_irqs()
  powerpc/xive: Add support for IRQ domain hierarchy
  powerpc/xive: Ease debugging of xive_irq_set_affinity()
  powerpc/pseries/pci: Add MSI domains
  powerpc/xive: Drop unmask of MSIs at startup
  powerpc/xive: Remove irqd_is_started() check when setting the affinity
  powerpc/pseries/pci: Add a domain_free_irqs() handler
  powerpc/pseries/pci: Add a msi_free() handler to clear XIVE data
  powerpc/pseries/pci: Add support of MSI domains to PHB hotplug
  powerpc/powernv/pci: Introduce __pnv_pci_ioda_msi_setup()
  powerpc/powernv/pci: Add MSI domains
  KVM: PPC: Book3S HV: Use the new IRQ chip to detect passthrough
interrupts
  KVM: PPC: Book3S HV: XIVE: Change interface of passthrough interrupt
routines
  KVM: PPC: Book3S HV: XIVE: Fix mapping of passthrough interrupts
  powerpc/xics: Remove ICS list
  powerpc/xics: Rename the map handler in a check handler
  powerpc/xics: Give a name to the default XICS IRQ domain
  powerpc/xics: Add debug logging to the set_irq_affinity handlers
  powerpc/xics: Add support for IRQ domain hierarchy
  powerpc/powernv/pci: Customize the MSI EOI handler to support PHB3
  powerpc/pci: Drop XIVE restriction on MSI domains
  powerpc/xics: Drop unmask of MSIs at startup
  powerpc/pseries/pci: Drop unused MSI code
  powerpc/powernv/pci: Drop unused MSI code
  powerpc/powernv/pci: Adapt is_pnv_opal_msi() to detect passthrough
interrupt
  powerpc/xics: Fix IRQ migration
  powerpc/powernv/pci: Set the IRQ chip data for P8/CXL devices
  powerpc/powernv/pci: Rework pnv_opal_pci_msi_eoi()
  KVM: PPC: Book3S HV: XICS: Fix mapping of passthrough interrupts
  powerpc/xive: Use XIVE domain under xmon and debugfs
  genirq: Improve "hwirq" output in /proc and /sys/

 arch/powerpc/include/asm/kvm_ppc.h |   4 +-
 arch/powerpc/include/asm/pci-bridge.h  |   5 +
 arch/powerpc/include/asm/pnv-pci.h |   2 +-
 arch/powerpc/include/asm/xics.h|   3 +-
 arch/powerpc/include/asm/xive.h|   1 +
 arch/powerpc/platforms/powernv/pci.h   |   6 -
 arch/powerpc/platforms/pseries/pseries.h   |   2 +
 arch/powerpc/kernel/pci-common.c   |   6 +
 arch/powerpc/kvm/book3s_hv.c   |  18 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   8 +-
 arch/powerpc/kvm/book3s_xive.c |  18 +-
 arch/powerpc/platforms/powernv/pci-ioda.c  | 256 --
 arch/powerpc/platforms/powernv/pci.c   |  67 -
 arch/powerpc/platforms/pseries/msi.c   | 296 -
 arch/powerpc/platforms/pseries/pci_dlpar.c |   4 +
 arch/powerpc/platforms/pseries/setup.c |   2 +
 arch/powerpc/sysdev/xics/ics-native.c  |  13 +-
 arch/powerpc/sysdev/xics/ics-opal.c|  40 +--
 arch/powerpc/sysdev/xics/ics-rtas.c|  40 +--
 arch/powerpc/sysdev/xics/xics-common.c | 129 ++---
 arch/powerpc/sysdev/xive/common.c  |  98 +--
 kernel/irq/irqdesc.c   |   2 +-
 kernel/irq/irqdomain.c |   1 +
 kernel/irq/proc.c  |   2 +-
 24 files changed, 710 insertions(+), 313 deletions(-)

-- 
2.31.1



[PATCH v2 05/32] powerpc/pseries/pci: Add MSI domains

2021-07-01 Thread Cédric Le Goater
Two IRQ domains are added on top of default machine IRQ domain.

First, the top level "pSeries-PCI-MSI" domain deals with the MSI
specificities. In this domain, the HW IRQ numbers are generated by the
PCI MSI layer, they compose a unique ID for an MSI source with the PCI
device identifier and the MSI vector number.

These numbers can be quite large on a pSeries machine running under
the IBM Hypervisor and /sys/kernel/irq/ and /proc/interrupts will
require small fixes to show them correctly.

Second domain is the in-the-middle "pSeries-MSI" domain which acts as
a proxy between the PCI MSI subsystem and the machine IRQ subsystem.
It usually allocate the MSI vector numbers but, on pSeries machines,
this is done by the RTAS FW and RTAS returns IRQ numbers in the IRQ
number space of the machine. This is why the in-the-middle "pSeries-MSI"
domain has the same HW IRQ numbers as its parent domain.

Only the XIVE (P9/P10) parent domain is supported for now. We still
need to add support for IRQ domain hierarchy under XICS.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/pci-bridge.h|   5 +
 arch/powerpc/platforms/pseries/pseries.h |   1 +
 arch/powerpc/kernel/pci-common.c |   6 +
 arch/powerpc/platforms/pseries/msi.c | 185 +++
 arch/powerpc/platforms/pseries/setup.c   |   2 +
 5 files changed, 199 insertions(+)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 74424c14515c..90f488fa4c17 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -126,6 +126,11 @@ struct pci_controller {
 #endif /* CONFIG_PPC64 */
 
void *private_data;
+
+   /* IRQ domain hierarchy */
+   struct irq_domain   *dev_domain;
+   struct irq_domain   *msi_domain;
+   struct fwnode_handle*fwnode;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 1f051a786fb3..d9280262588b 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -85,6 +85,7 @@ struct pci_host_bridge;
 int pseries_root_bridge_prepare(struct pci_host_bridge *bridge);
 
 extern struct pci_controller_ops pseries_pci_controller_ops;
+int pseries_msi_allocate_domains(struct pci_controller *phb);
 
 unsigned long pseries_memory_block_size(void);
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 001e90cd8948..c3573430919d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1060,11 +1061,16 @@ void pcibios_bus_add_device(struct pci_dev *dev)
 
 int pcibios_add_device(struct pci_dev *dev)
 {
+   struct irq_domain *d;
+
 #ifdef CONFIG_PCI_IOV
if (ppc_md.pcibios_fixup_sriov)
ppc_md.pcibios_fixup_sriov(dev);
 #endif /* CONFIG_PCI_IOV */
 
+   d = dev_get_msi_domain(>bus->dev);
+   if (d)
+   dev_set_msi_domain(>dev, d);
return 0;
 }
 
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 4bf14f27e1aa..86c6809ebac2 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
@@ -518,6 +519,190 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return 0;
 }
 
+static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device 
*dev,
+  int nvec, msi_alloc_info_t *arg)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct msi_desc *desc = first_pci_msi_entry(pdev);
+   int type = desc->msi_attrib.is_msix ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI;
+
+   return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
+}
+
+static struct msi_domain_ops pseries_pci_msi_domain_ops = {
+   .msi_prepare= pseries_msi_ops_prepare,
+};
+
+static void pseries_msi_shutdown(struct irq_data *d)
+{
+   d = d->parent_data;
+   if (d->chip->irq_shutdown)
+   d->chip->irq_shutdown(d);
+}
+
+static void pseries_msi_mask(struct irq_data *d)
+{
+   pci_msi_mask_irq(d);
+   irq_chip_mask_parent(d);
+}
+
+static void pseries_msi_unmask(struct irq_data *d)
+{
+   pci_msi_unmask_irq(d);
+   irq_chip_unmask_parent(d);
+}
+
+static struct irq_chip pseries_pci_msi_irq_chip = {
+   .name   = "pSeries-PCI-MSI",
+   .irq_shutdown   = pseries_msi_shutdown,
+   .irq_mask   = pseries_msi_mask,
+   .irq_unmask = pseries_msi_unmask,
+   .irq_eoi= irq_chip_eoi_parent,
+};
+
+static struct msi_domain_info pseries_msi_domain_info = {
+   .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
+ MSI_FLAG_MULTI_PCI_MSI  | 

[PATCH v2 04/32] powerpc/xive: Ease debugging of xive_irq_set_affinity()

2021-07-01 Thread Cédric Le Goater
pr_debug() is easier to activate and it helps to know how the kernel
configures the HW when tweaking the IRQ subsystem.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 834f1a378fc2..2c907a4a2b05 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -723,7 +723,7 @@ static int xive_irq_set_affinity(struct irq_data *d,
u32 target, old_target;
int rc = 0;
 
-   pr_devel("xive_irq_set_affinity: irq %d\n", d->irq);
+   pr_debug("%s: irq %d/%x\n", __func__, d->irq, hw_irq);
 
/* Is this valid ? */
if (cpumask_any_and(cpumask, cpu_online_mask) >= nr_cpu_ids)
@@ -768,7 +768,7 @@ static int xive_irq_set_affinity(struct irq_data *d,
return rc;
}
 
-   pr_devel("  target: 0x%x\n", target);
+   pr_debug("  target: 0x%x\n", target);
xd->target = target;
 
/* Give up previous target */
-- 
2.31.1



[PATCH v2 09/32] powerpc/pseries/pci: Add a msi_free() handler to clear XIVE data

2021-07-01 Thread Cédric Le Goater
The MSI domain clears the IRQ with msi_domain_free(), which calls
irq_domain_free_irqs_top(), which clears the handler data. This is a
problem for the XIVE controller since we need to unmap MMIO pages and
free a specific XIVE structure.

The 'msi_free()' handler is called before irq_domain_free_irqs_top()
when the handler data is still available. Use that to clear the XIVE
controller data.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h  |  1 +
 arch/powerpc/platforms/pseries/msi.c | 16 +++-
 arch/powerpc/sysdev/xive/common.c|  5 -
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index aa094a8655b0..20ae50ab083c 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -111,6 +111,7 @@ void xive_native_free_vp_block(u32 vp_base);
 int xive_native_populate_irq_data(u32 hw_irq,
  struct xive_irq_data *data);
 void xive_cleanup_irq_data(struct xive_irq_data *xd);
+void xive_irq_free_data(unsigned int virq);
 void xive_native_free_irq(u32 irq);
 int xive_native_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq);
 
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 591cee9cbc9e..f9635b01b2bf 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -529,6 +529,19 @@ static int pseries_msi_ops_prepare(struct irq_domain 
*domain, struct device *dev
return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
 }
 
+/*
+ * ->msi_free() is called before irq_domain_free_irqs_top() when the
+ * handler data is still available. Use that to clear the XIVE
+ * controller data.
+ */
+static void pseries_msi_ops_msi_free(struct irq_domain *domain,
+struct msi_domain_info *info,
+unsigned int irq)
+{
+   if (xive_enabled())
+   xive_irq_free_data(irq);
+}
+
 /*
  * RTAS can not disable one MSI at a time. It's all or nothing. Do it
  * at the end after all IRQs have been freed.
@@ -546,6 +559,7 @@ static void pseries_msi_domain_free_irqs(struct irq_domain 
*domain,
 
 static struct msi_domain_ops pseries_pci_msi_domain_ops = {
.msi_prepare= pseries_msi_ops_prepare,
+   .msi_free   = pseries_msi_ops_msi_free,
.domain_free_irqs = pseries_msi_domain_free_irqs,
 };
 
@@ -660,7 +674,7 @@ static void pseries_irq_domain_free(struct irq_domain 
*domain, unsigned int virq
 
pr_debug("%s bridge %pOF %d #%d\n", __func__, phb->dn, virq, nr_irqs);
 
-   irq_domain_free_irqs_parent(domain, virq, nr_irqs);
+   /* XIVE domain data is cleared through ->msi_free() */
 }
 
 static const struct irq_domain_ops pseries_irq_domain_ops = {
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 38183c9b21c0..f0012d6b4fe9 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -986,6 +986,8 @@ EXPORT_SYMBOL_GPL(is_xive_irq);
 
 void xive_cleanup_irq_data(struct xive_irq_data *xd)
 {
+   pr_debug("%s for HW %x\n", __func__, xd->hw_irq);
+
if (xd->eoi_mmio) {
iounmap(xd->eoi_mmio);
if (xd->eoi_mmio == xd->trig_mmio)
@@ -1027,7 +1029,7 @@ static int xive_irq_alloc_data(unsigned int virq, 
irq_hw_number_t hw)
return 0;
 }
 
-static void xive_irq_free_data(unsigned int virq)
+void xive_irq_free_data(unsigned int virq)
 {
struct xive_irq_data *xd = irq_get_handler_data(virq);
 
@@ -1037,6 +1039,7 @@ static void xive_irq_free_data(unsigned int virq)
xive_cleanup_irq_data(xd);
kfree(xd);
 }
+EXPORT_SYMBOL_GPL(xive_irq_free_data);
 
 #ifdef CONFIG_SMP
 
-- 
2.31.1



[PATCH v2 29/32] powerpc/powernv/pci: Rework pnv_opal_pci_msi_eoi()

2021-07-01 Thread Cédric Le Goater
pnv_opal_pci_msi_eoi() is called from KVM to EOI passthrough interrupts
when in real mode. Adding MSI domain broke the hack using the
'ioda.irq_chip' field to deduce the owning PHB. Fix that by using the
IRQ chip data in the MSI domain.

The 'ioda.irq_chip' field is now unused and could be removed from the
pnv_phb struct.

Cc: Alexey Kardashevskiy 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/pnv-pci.h|  2 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |  8 
 arch/powerpc/platforms/powernv/pci-ioda.c | 17 +
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index d0ee0ede5767..b3f480799352 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -33,7 +33,7 @@ int pnv_cxl_alloc_hwirqs(struct pci_dev *dev, int num);
 void pnv_cxl_release_hwirqs(struct pci_dev *dev, int hwirq, int num);
 int pnv_cxl_get_irq_count(struct pci_dev *dev);
 struct device_node *pnv_pci_get_phb_node(struct pci_dev *dev);
-int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, unsigned int hw_irq);
+int64_t pnv_opal_pci_msi_eoi(struct irq_data *d);
 bool is_pnv_opal_msi(struct irq_chip *chip);
 
 #ifdef CONFIG_CXL_BASE
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index 0a11ec88a0ae..587c33fc4564 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -706,6 +706,7 @@ static int ics_rm_eoi(struct kvm_vcpu *vcpu, u32 irq)
icp->rm_eoied_irq = irq;
}
 
+   /* Handle passthrough interrupts */
if (state->host_irq) {
++vcpu->stat.pthru_all;
if (state->intr_cpu != -1) {
@@ -759,12 +760,12 @@ int xics_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long 
xirr)
 
 static unsigned long eoi_rc;
 
-static void icp_eoi(struct irq_chip *c, u32 hwirq, __be32 xirr, bool *again)
+static void icp_eoi(struct irq_data *d, u32 hwirq, __be32 xirr, bool *again)
 {
void __iomem *xics_phys;
int64_t rc;
 
-   rc = pnv_opal_pci_msi_eoi(c, hwirq);
+   rc = pnv_opal_pci_msi_eoi(d);
 
if (rc)
eoi_rc = rc;
@@ -872,8 +873,7 @@ long kvmppc_deliver_irq_passthru(struct kvm_vcpu *vcpu,
icp_rm_deliver_irq(xics, icp, irq, false);
 
/* EOI the interrupt */
-   icp_eoi(irq_desc_get_chip(irq_map->desc), irq_map->r_hwirq, xirr,
-   again);
+   icp_eoi(irq_desc_get_irq_data(irq_map->desc), irq_map->r_hwirq, xirr, 
again);
 
if (check_too_hard(xics, icp) == H_TOO_HARD)
return 2;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index aa97245eedbf..2389cd79c3c8 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1963,12 +1963,21 @@ void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe->dma_setup_done = true;
 }
 
-int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, unsigned int hw_irq)
+/*
+ * Called from KVM in real mode to EOI passthru interrupts. The ICP
+ * EOI is handled directly in KVM in kvmppc_deliver_irq_passthru().
+ *
+ * The IRQ data is mapped in the PCI-MSI domain and the EOI OPAL call
+ * needs an HW IRQ number mapped in the XICS IRQ domain. The HW IRQ
+ * numbers of the in-the-middle MSI domain are vector numbers and it's
+ * good enough for OPAL. Use that.
+ */
+int64_t pnv_opal_pci_msi_eoi(struct irq_data *d)
 {
-   struct pnv_phb *phb = container_of(chip, struct pnv_phb,
-  ioda.irq_chip);
+   struct pci_controller *hose = 
irq_data_get_irq_chip_data(d->parent_data);
+   struct pnv_phb *phb = hose->private_data;
 
-   return opal_pci_msi_eoi(phb->opal_id, hw_irq);
+   return opal_pci_msi_eoi(phb->opal_id, d->parent_data->hwirq);
 }
 
 /*
-- 
2.31.1



[PATCH v2 27/32] powerpc/xics: Fix IRQ migration

2021-07-01 Thread Cédric Le Goater
desc->irq_data points to the top level IRQ data descriptor which is
not necessarily in the XICS IRQ domain. MSIs are in another domain for
instance. Fix that by looking for a mapping on the low level XICS IRQ
domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index e82d0d4ddec0..0b8b49446992 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -183,6 +183,8 @@ void xics_migrate_irqs_away(void)
unsigned int irq, virq;
struct irq_desc *desc;
 
+   pr_debug("%s: CPU %u\n", __func__, cpu);
+
/* If we used to be the default server, move to the new "boot_cpuid" */
if (hw_cpu == xics_default_server)
xics_update_irq_servers();
@@ -197,6 +199,7 @@ void xics_migrate_irqs_away(void)
struct irq_chip *chip;
long server;
unsigned long flags;
+   struct irq_data *irqd;
 
/* We can't set affinity on ISA interrupts */
if (virq < NUM_ISA_INTERRUPTS)
@@ -204,9 +207,11 @@ void xics_migrate_irqs_away(void)
/* We only need to migrate enabled IRQS */
if (!desc->action)
continue;
-   if (desc->irq_data.domain != xics_host)
+   /* We need a mapping in the XICS IRQ domain */
+   irqd = irq_domain_get_irq_data(xics_host, virq);
+   if (!irqd)
continue;
-   irq = desc->irq_data.hwirq;
+   irq = irqd_to_hwirq(irqd);
/* We need to get IPIs still. */
if (irq == XICS_IPI || irq == XICS_IRQ_SPURIOUS)
continue;
-- 
2.31.1



[PATCH v2 25/32] powerpc/powernv/pci: Drop unused MSI code

2021-07-01 Thread Cédric Le Goater
MSIs should be fully managed by the PCI and IRQ subsystems now.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci.h  |  6 --
 arch/powerpc/platforms/powernv/pci-ioda.c | 27 -
 arch/powerpc/platforms/powernv/pci.c  | 67 ---
 3 files changed, 100 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index c8d4f222a86f..966a9eb64339 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -123,11 +123,7 @@ struct pnv_phb {
 #endif
 
unsigned intmsi_base;
-   unsigned intmsi32_support;
struct msi_bitmap   msi_bmp;
-   int (*msi_setup)(struct pnv_phb *phb, struct pci_dev *dev,
-unsigned int hwirq, unsigned int virq,
-unsigned int is_64, struct msi_msg *msg);
int (*init_m64)(struct pnv_phb *phb);
int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
@@ -289,8 +285,6 @@ extern void pnv_pci_init_npu2_opencapi_phb(struct 
device_node *np);
 extern void pnv_pci_reset_secondary_bus(struct pci_dev *dev);
 extern int pnv_eeh_phb_reset(struct pci_controller *hose, int option);
 
-extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
-extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_pci_bdfn_to_pe(struct pnv_phb *phb, u16 bdfn);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e2454439e574..eb38ce1fd434 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2080,29 +2080,6 @@ static int __pnv_pci_ioda_msi_setup(struct pnv_phb *phb, 
struct pci_dev *dev,
return 0;
 }
 
-static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
- unsigned int hwirq, unsigned int virq,
- unsigned int is_64, struct msi_msg *msg)
-{
-   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
-   unsigned int xive_num = hwirq - phb->msi_base;
-   int rc;
-
-   rc = __pnv_pci_ioda_msi_setup(phb, dev, xive_num, is_64, msg);
-   if (rc)
-   return rc;
-
-   /* P8 only */
-   pnv_set_msi_irq_chip(phb, virq);
-
-   pr_devel("%s: %s-bit MSI on hwirq %x (xive #%d),"
-" address=%x_%08x data=%x PE# %x\n",
-pci_name(dev), is_64 ? "64" : "32", hwirq, xive_num,
-msg->address_hi, msg->address_lo, msg->data, pe->pe_number);
-
-   return 0;
-}
-
 /*
  * The msi_free() op is called before irq_domain_free_irqs_top() when
  * the handler data is still available. Use that to clear the XIVE
@@ -2327,8 +2304,6 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
return;
}
 
-   phb->msi_setup = pnv_pci_ioda_msi_setup;
-   phb->msi32_support = 1;
pr_info("  Allocated bitmap for %d MSIs (base IRQ 0x%x)\n",
count, phb->msi_base);
 
@@ -2936,8 +2911,6 @@ static const struct pci_controller_ops 
pnv_pci_ioda_controller_ops = {
.dma_dev_setup  = pnv_pci_ioda_dma_dev_setup,
.dma_bus_setup  = pnv_pci_ioda_dma_bus_setup,
.iommu_bypass_supported = pnv_pci_ioda_iommu_bypass_supported,
-   .setup_msi_irqs = pnv_setup_msi_irqs,
-   .teardown_msi_irqs  = pnv_teardown_msi_irqs,
.enable_device_hook = pnv_pci_enable_device_hook,
.release_device = pnv_pci_release_device,
.window_alignment   = pnv_pci_window_alignment,
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index b18468dc31ff..e9dee50ea881 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -160,73 +160,6 @@ int pnv_pci_set_power_state(uint64_t id, uint8_t state, 
struct opal_msg *msg)
 }
 EXPORT_SYMBOL_GPL(pnv_pci_set_power_state);
 
-int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
-{
-   struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
-   struct msi_desc *entry;
-   struct msi_msg msg;
-   int hwirq;
-   unsigned int virq;
-   int rc;
-
-   if (WARN_ON(!phb) || !phb->msi_bmp.bitmap)
-   return -ENODEV;
-
-   if (pdev->no_64bit_msi && !phb->msi32_support)
-   return -ENODEV;
-
-   for_each_pci_msi_entry(entry, pdev) {
-   if (!entry->msi_attrib.is_64 && !phb->msi32_support) {
-   pr_warn("%s: Supports only 64-bit MSIs\n",
-   pci_name(pdev));
-   return -ENXIO;
-   }
-   hwirq = 

Re: [PATCH] sched: Use WARN_ON

2021-07-01 Thread Arnd Bergmann
On Thu, Jul 1, 2021 at 2:57 PM Christophe Leroy
 wrote:
> Le 01/07/2021 à 14:50, Jason Wang a écrit :
> > The BUG_ON macro simplifies the if condition followed by BUG, but it
> > will lead to the kernel crashing. Therefore, we can try using WARN_ON
> > instead of if condition followed by BUG.
>
> But are you sure it is ok to continue if spu_acquire(ctx) returned false ?
> Shouldn't there be at least for fallback handling ?
>
> Something like:
>
> if (WARN_ON(spu_acquire(ctx)))
> return;

I think you get a crash in either case:

- with the existing BUG_ON() there is an immediate backtrace and it stops there
- with WARN_ON() and continuing, you operate on a context that is not
  valid
- with the 'return', you get an endless loop, as it keeps calling
spusched_tick()
  without sleeping.

Out of those options, the existing BUG_ON() seems best.

   Arnd


Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-01 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of July 1, 2021 5:37 pm:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
>> On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
>>> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
>>> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
>>> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
>>> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
>>> >> >> "Christopher M. Riedl"  writes:
>>> >> >>
>>> >> >> > Switching to a different mm with Hash translation causes SLB 
>>> >> >> > entries to
>>> >> >> > be preloaded from the current thread_info. This reduces SLB faults, 
>>> >> >> > for
>>> >> >> > example when threads share a common mm but operate on different 
>>> >> >> > address
>>> >> >> > ranges.
>>> >> >> >
>>> >> >> > Preloading entries from the thread_info struct may not always be
>>> >> >> > appropriate - such as when switching to a temporary mm. Introduce a 
>>> >> >> > new
>>> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move 
>>> >> >> > the
>>> >> >> > SLB preload code into a separate function since switch_slb() is 
>>> >> >> > already
>>> >> >> > quite long. The default behavior (preloading SLB entries from the
>>> >> >> > current thread_info struct) remains unchanged.
>>> >> >> >
>>> >> >> > Signed-off-by: Christopher M. Riedl 
>>> >> >> >
>>> >> >> > ---
>>> >> >> >
>>> >> >> > v4:  * New to series.
>>> >> >> > ---
>>> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>>> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
>>> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
>>> >> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 
>>> >> >> > ++--
>>> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
>>> >> >> >
>>> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
>>> >> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
>>> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
>>> >> >> > @@ -130,6 +130,9 @@ typedef struct {
>>> >> >> > u32 pkey_allocation_map;
>>> >> >> > s16 execute_only_pkey; /* key holding execute-only protection */
>>> >> >> >  #endif
>>> >> >> > +
>>> >> >> > +   /* Do not preload SLB entries from thread_info during 
>>> >> >> > switch_slb() */
>>> >> >> > +   bool skip_slb_preload;
>>> >> >> >  } mm_context_t;
>>> >> >> >  
>>> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
>>> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
>>> >> >> > b/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
>>> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
>>> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct 
>>> >> >> > mm_struct *oldmm,
>>> >> >> > return 0;
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
>>> >> >> > +
>>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
>>> >> >> > +{
>>> >> >> > +   mm->context.skip_slb_preload = true;
>>> >> >> > +}
>>> >> >> > +
>>> >> >> > +#else
>>> >> >> > +
>>> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
>>> >> >> > +
>>> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
>>> >> >> > +
>>> >> >> >  #include 
>>> >> >> >  
>>> >> >> >  #endif /* __KERNEL__ */
>>> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
>>> >> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > index c10fc8a72fb37..3479910264c59 100644
>>> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
>>> >> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, 
>>> >> >> > struct mm_struct *mm)
>>> >> >> > atomic_set(>context.active_cpus, 0);
>>> >> >> > atomic_set(>context.copros, 0);
>>> >> >> >  
>>> >> >> > +   mm->context.skip_slb_preload = false;
>>> >> >> > +
>>> >> >> > return 0;
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
>>> >> >> > b/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > index c91bd85eb90e3..da0836cb855af 100644
>>> >> >> > --- a/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
>>> >> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
>>> >> >> > index)
>>> >> >> > asm volatile("slbie %0" : : "r" (slbie_data));
>>> >> >> >  }
>>> >> >> >  
>>> >> >> > +static void preload_slb_entries(struct task_struct *tsk, struct 
>>> >> >> > mm_struct *mm)
>>> >> >> Should this be explicitly inline or even __always_inline? I'm thinking
>>> >> >> switch_slb is probably a fairly hot path on hash?
>>> >> > 
>>> >> > Yes absolutely. I'll make this change in v5.
>>> >> > 
>>> >> >>
>>> >> >> > +{
>>> >> >> > +   

[PATCH] powerpc: Only build restart_table.c for 64s

2021-07-01 Thread Michael Ellerman
Commit 9b69d48c7516 ("powerpc/64e: remove implicit soft-masking and
interrupt exit restart logic") limited the implicit soft masking and
restart logic to 64-bit Book3S only. However we are still building
restart_table.c for all 64-bit, ie. Book3E also.

There's no need to build it for 64e, and it also causes missing
prototype warnings for 64e builds, because the prototype is already
behind an #ifdef PPC_BOOK3S_64.

Fixes: 9b69d48c7516 ("powerpc/64e: remove implicit soft-masking and interrupt 
exit restart logic")
Reported-by: kernel test robot 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/lib/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 4c92c80454f3..99a7c9132422 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -39,10 +39,10 @@ extra-$(CONFIG_PPC64)   += crtsavres.o
 endif
 
 obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o copypage_power7.o \
-  memcpy_power7.o
+  memcpy_power7.o restart_table.o
 
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
-  memcpy_64.o copy_mc_64.o restart_table.o
+  memcpy_64.o copy_mc_64.o
 
 ifndef CONFIG_PPC_QUEUED_SPINLOCKS
 obj64-$(CONFIG_SMP)+= locks.o
-- 
2.25.1



Re: [PATCH] sched: Use WARN_ON

2021-07-01 Thread Christophe Leroy




Le 01/07/2021 à 14:50, Jason Wang a écrit :

The BUG_ON macro simplifies the if condition followed by BUG, but it
will lead to the kernel crashing. Therefore, we can try using WARN_ON
instead of if condition followed by BUG.


But are you sure it is ok to continue if spu_acquire(ctx) returned false ?
Shouldn't there be at least for fallback handling ?

Something like:

if (WARN_ON(spu_acquire(ctx)))
return;


Christophe




Signed-off-by: Jason Wang 
---
  arch/powerpc/platforms/cell/spufs/sched.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index 369206489895..0f218d9e5733 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -904,8 +904,8 @@ static noinline void spusched_tick(struct spu_context *ctx)
struct spu_context *new = NULL;
struct spu *spu = NULL;
  
-	if (spu_acquire(ctx))

-   BUG();  /* a kernel thread never has signals pending */
+   /* a kernel thread never has signals pending */
+   WARN_ON(spu_acquire(ctx));
  
  	if (ctx->state != SPU_STATE_RUNNABLE)

goto out;



Re: [RFC PATCH 10/43] powerpc/64s: Always set PMU control registers to frozen/disabled when not in use

2021-07-01 Thread Madhavan Srinivasan



On 6/22/21 4:27 PM, Nicholas Piggin wrote:

KVM PMU management code looks for particular frozen/disabled bits in
the PMU registers so it knows whether it must clear them when coming
out of a guest or not. Setting this up helps KVM make these optimisations
without getting confused. Longer term the better approach might be to
move guest/host PMU switching to the perf subsystem.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kernel/cpu_setup_power.c | 4 ++--
  arch/powerpc/kernel/dt_cpu_ftrs.c | 6 +++---
  arch/powerpc/kvm/book3s_hv.c  | 5 +
  arch/powerpc/perf/core-book3s.c   | 7 +++
  4 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.c 
b/arch/powerpc/kernel/cpu_setup_power.c
index a29dc8326622..3dc61e203f37 100644
--- a/arch/powerpc/kernel/cpu_setup_power.c
+++ b/arch/powerpc/kernel/cpu_setup_power.c
@@ -109,7 +109,7 @@ static void init_PMU_HV_ISA207(void)
  static void init_PMU(void)
  {
mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);


Sticky point here is, currently if not frozen, pmc5/6 will
keep countering. And not freezing them at boot is quiet useful
sometime, like say when running in a simulation where we could calculate
approx CPIs for micro benchmarks without perf subsystem.


mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
  }
@@ -123,7 +123,7 @@ static void init_PMU_ISA31(void)
  {
mtspr(SPRN_MMCR3, 0);
mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
-   mtspr(SPRN_MMCR0, MMCR0_PMCCEXT);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
  }

  /*
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 0a6b36b4bda8..06a089fbeaa7 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -353,7 +353,7 @@ static void init_pmu_power8(void)
}

mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
mtspr(SPRN_MMCRS, 0);
@@ -392,7 +392,7 @@ static void init_pmu_power9(void)
mtspr(SPRN_MMCRC, 0);

mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
  }
@@ -428,7 +428,7 @@ static void init_pmu_power10(void)

mtspr(SPRN_MMCR3, 0);
mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
-   mtspr(SPRN_MMCR0, MMCR0_PMCCEXT);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
  }

  static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1f30f98b09d1..f7349d150828 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2593,6 +2593,11 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
*vcpu)
  #endif
  #endif
vcpu->arch.mmcr[0] = MMCR0_FC;
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   vcpu->arch.mmcr[0] |= MMCR0_PMCCEXT;
+   vcpu->arch.mmcra = MMCRA_BHRB_DISABLE;
+   }
+
vcpu->arch.ctrl = CTRL_RUNLATCH;
/* default to host PVR, since we can't spoof it */
kvmppc_set_pvr_hv(vcpu, mfspr(SPRN_PVR));
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 51622411a7cc..e33b29ec1a65 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1361,6 +1361,13 @@ static void power_pmu_enable(struct pmu *pmu)
goto out;

if (cpuhw->n_events == 0) {
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
+   } else {
+   mtspr(SPRN_MMCRA, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
+   }
ppc_set_pmu_inuse(0);
goto out;
}


[PATCH v2 01/32] powerpc/pseries/pci: Introduce __find_pe_total_msi()

2021-07-01 Thread Cédric Le Goater
It will help to size the PCI MSI domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 637300330507..d2d090e04745 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -164,12 +164,12 @@ static int check_req_msix(struct pci_dev *pdev, int nvec)
 
 /* Quota calculation */
 
-static struct device_node *find_pe_total_msi(struct pci_dev *dev, int *total)
+static struct device_node *__find_pe_total_msi(struct device_node *node, int 
*total)
 {
struct device_node *dn;
const __be32 *p;
 
-   dn = of_node_get(pci_device_to_OF_node(dev));
+   dn = of_node_get(node);
while (dn) {
p = of_get_property(dn, "ibm,pe-total-#msi", NULL);
if (p) {
@@ -185,6 +185,11 @@ static struct device_node *find_pe_total_msi(struct 
pci_dev *dev, int *total)
return NULL;
 }
 
+static struct device_node *find_pe_total_msi(struct pci_dev *dev, int *total)
+{
+   return __find_pe_total_msi(pci_device_to_OF_node(dev), total);
+}
+
 static struct device_node *find_pe_dn(struct pci_dev *dev, int *total)
 {
struct device_node *dn;
-- 
2.31.1



[PATCH v2 11/32] powerpc/powernv/pci: Introduce __pnv_pci_ioda_msi_setup()

2021-07-01 Thread Cédric Le Goater
It will be used as a 'compose_msg' handler of the MSI domain introduced
later.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 28 +++
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7de464679292..2922674cc934 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2016,15 +2016,17 @@ bool is_pnv_opal_msi(struct irq_chip *chip)
 }
 EXPORT_SYMBOL_GPL(is_pnv_opal_msi);
 
-static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
- unsigned int hwirq, unsigned int virq,
- unsigned int is_64, struct msi_msg *msg)
+static int __pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
+   unsigned int xive_num,
+   unsigned int is_64, struct msi_msg *msg)
 {
struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
-   unsigned int xive_num = hwirq - phb->msi_base;
__be32 data;
int rc;
 
+   dev_dbg(>dev, "%s: setup %s-bit MSI for vector #%d\n", __func__,
+   is_64 ? "64" : "32", xive_num);
+
/* No PE assigned ? bail out ... no MSI for you ! */
if (pe == NULL)
return -ENXIO;
@@ -2072,12 +2074,28 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, 
struct pci_dev *dev,
}
msg->data = be32_to_cpu(data);
 
+   return 0;
+}
+
+static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
+ unsigned int hwirq, unsigned int virq,
+ unsigned int is_64, struct msi_msg *msg)
+{
+   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
+   unsigned int xive_num = hwirq - phb->msi_base;
+   int rc;
+
+   rc = __pnv_pci_ioda_msi_setup(phb, dev, xive_num, is_64, msg);
+   if (rc)
+   return rc;
+
+   /* P8 only */
pnv_set_msi_irq_chip(phb, virq);
 
pr_devel("%s: %s-bit MSI on hwirq %x (xive #%d),"
 " address=%x_%08x data=%x PE# %x\n",
 pci_name(dev), is_64 ? "64" : "32", hwirq, xive_num,
-msg->address_hi, msg->address_lo, data, pe->pe_number);
+msg->address_hi, msg->address_lo, msg->data, pe->pe_number);
 
return 0;
 }
-- 
2.31.1



[PATCH v2 12/32] powerpc/powernv/pci: Add MSI domains

2021-07-01 Thread Cédric Le Goater
This is very similar to the MSI domains of the pSeries platform. The
MSI allocator is directly handled under the Linux PHB in the
in-the-middle "PNV-MSI" domain.

Only the XIVE (P9/P10) parent domain is supported for now. Support for
XICS will come later.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 188 ++
 1 file changed, 188 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2922674cc934..d2a17fcb6002 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -2100,6 +2101,189 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, 
struct pci_dev *dev,
return 0;
 }
 
+/*
+ * The msi_free() op is called before irq_domain_free_irqs_top() when
+ * the handler data is still available. Use that to clear the XIVE
+ * controller.
+ */
+static void pnv_msi_ops_msi_free(struct irq_domain *domain,
+struct msi_domain_info *info,
+unsigned int irq)
+{
+   if (xive_enabled())
+   xive_irq_free_data(irq);
+}
+
+static struct msi_domain_ops pnv_pci_msi_domain_ops = {
+   .msi_free   = pnv_msi_ops_msi_free,
+};
+
+static void pnv_msi_shutdown(struct irq_data *d)
+{
+   d = d->parent_data;
+   if (d->chip->irq_shutdown)
+   d->chip->irq_shutdown(d);
+}
+
+static void pnv_msi_mask(struct irq_data *d)
+{
+   pci_msi_mask_irq(d);
+   irq_chip_mask_parent(d);
+}
+
+static void pnv_msi_unmask(struct irq_data *d)
+{
+   pci_msi_unmask_irq(d);
+   irq_chip_unmask_parent(d);
+}
+
+static struct irq_chip pnv_pci_msi_irq_chip = {
+   .name   = "PNV-PCI-MSI",
+   .irq_shutdown   = pnv_msi_shutdown,
+   .irq_mask   = pnv_msi_mask,
+   .irq_unmask = pnv_msi_unmask,
+   .irq_eoi= irq_chip_eoi_parent,
+};
+
+static struct msi_domain_info pnv_msi_domain_info = {
+   .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
+ MSI_FLAG_MULTI_PCI_MSI  | MSI_FLAG_PCI_MSIX),
+   .ops   = _pci_msi_domain_ops,
+   .chip  = _pci_msi_irq_chip,
+};
+
+static void pnv_msi_compose_msg(struct irq_data *d, struct msi_msg *msg)
+{
+   struct msi_desc *entry = irq_data_get_msi_desc(d);
+   struct pci_dev *pdev = msi_desc_to_pci_dev(entry);
+   struct pci_controller *hose = irq_data_get_irq_chip_data(d);
+   struct pnv_phb *phb = hose->private_data;
+   int rc;
+
+   rc = __pnv_pci_ioda_msi_setup(phb, pdev, d->hwirq,
+ entry->msi_attrib.is_64, msg);
+   if (rc)
+   dev_err(>dev, "Failed to setup %s-bit MSI #%ld : %d\n",
+   entry->msi_attrib.is_64 ? "64" : "32", d->hwirq, rc);
+}
+
+static struct irq_chip pnv_msi_irq_chip = {
+   .name   = "PNV-MSI",
+   .irq_shutdown   = pnv_msi_shutdown,
+   .irq_mask   = irq_chip_mask_parent,
+   .irq_unmask = irq_chip_unmask_parent,
+   .irq_eoi= irq_chip_eoi_parent,
+   .irq_set_affinity   = irq_chip_set_affinity_parent,
+   .irq_compose_msi_msg= pnv_msi_compose_msg,
+};
+
+static int pnv_irq_parent_domain_alloc(struct irq_domain *domain,
+  unsigned int virq, int hwirq)
+{
+   struct irq_fwspec parent_fwspec;
+   int ret;
+
+   parent_fwspec.fwnode = domain->parent->fwnode;
+   parent_fwspec.param_count = 2;
+   parent_fwspec.param[0] = hwirq;
+   parent_fwspec.param[1] = IRQ_TYPE_EDGE_RISING;
+
+   ret = irq_domain_alloc_irqs_parent(domain, virq, 1, _fwspec);
+   if (ret)
+   return ret;
+
+   return 0;
+}
+
+static int pnv_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
+   unsigned int nr_irqs, void *arg)
+{
+   struct pci_controller *hose = domain->host_data;
+   struct pnv_phb *phb = hose->private_data;
+   msi_alloc_info_t *info = arg;
+   struct pci_dev *pdev = msi_desc_to_pci_dev(info->desc);
+   int hwirq;
+   int i, ret;
+
+   hwirq = msi_bitmap_alloc_hwirqs(>msi_bmp, nr_irqs);
+   if (hwirq < 0) {
+   dev_warn(>dev, "failed to find a free MSI\n");
+   return -ENOSPC;
+   }
+
+   dev_dbg(>dev, "%s bridge %pOF %d/%x #%d\n", __func__,
+   hose->dn, virq, hwirq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++) {
+   ret = pnv_irq_parent_domain_alloc(domain, virq + i,
+ phb->msi_base + hwirq + i);
+   if (ret)
+   goto out;
+
+   irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i,
+ _msi_irq_chip, hose);
+ 

[PATCH v2 13/32] KVM: PPC: Book3S HV: Use the new IRQ chip to detect passthrough interrupts

2021-07-01 Thread Cédric Le Goater
Passthrough PCI MSI interrupts are detected in KVM with a check on a
specific EOI handler (P8) or on XIVE (P9). We can now check the
PCI-MSI IRQ chip which is cleaner.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_hv.c  | 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index dd39b5373075..048b4ca55cfe 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5260,7 +5260,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
 * what our real-mode EOI code does, or a XIVE interrupt
 */
chip = irq_data_get_irq_chip(>irq_data);
-   if (!chip || !(is_pnv_opal_msi(chip) || is_xive_irq(chip))) {
+   if (!chip || !is_pnv_opal_msi(chip)) {
pr_warn("kvmppc_set_passthru_irq_hv: Could not assign IRQ map 
for (%d,%d)\n",
host_irq, guest_gsi);
mutex_unlock(>lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index d2a17fcb6002..e77caa4dbbdf 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2007,13 +2007,15 @@ void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned 
int virq)
irq_set_chip(virq, >ioda.irq_chip);
 }
 
+static struct irq_chip pnv_pci_msi_irq_chip;
+
 /*
  * Returns true iff chip is something that we could call
  * pnv_opal_pci_msi_eoi for.
  */
 bool is_pnv_opal_msi(struct irq_chip *chip)
 {
-   return chip->irq_eoi == pnv_ioda2_msi_eoi;
+   return chip->irq_eoi == pnv_ioda2_msi_eoi || chip == 
_pci_msi_irq_chip;
 }
 EXPORT_SYMBOL_GPL(is_pnv_opal_msi);
 
-- 
2.31.1



[PATCH v2 08/32] powerpc/pseries/pci: Add a domain_free_irqs() handler

2021-07-01 Thread Cédric Le Goater
The RTAS firmware can not disable one MSI at a time. It's all or
nothing. We need a custom free IRQ handler for that.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index 86c6809ebac2..591cee9cbc9e 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -529,8 +529,24 @@ static int pseries_msi_ops_prepare(struct irq_domain 
*domain, struct device *dev
return rtas_prepare_msi_irqs(pdev, nvec, type, arg);
 }
 
+/*
+ * RTAS can not disable one MSI at a time. It's all or nothing. Do it
+ * at the end after all IRQs have been freed.
+ */
+static void pseries_msi_domain_free_irqs(struct irq_domain *domain,
+struct device *dev)
+{
+   if (WARN_ON_ONCE(!dev_is_pci(dev)))
+   return;
+
+   __msi_domain_free_irqs(domain, dev);
+
+   rtas_disable_msi(to_pci_dev(dev));
+}
+
 static struct msi_domain_ops pseries_pci_msi_domain_ops = {
.msi_prepare= pseries_msi_ops_prepare,
+   .domain_free_irqs = pseries_msi_domain_free_irqs,
 };
 
 static void pseries_msi_shutdown(struct irq_data *d)
-- 
2.31.1



[PATCH v2 17/32] powerpc/xics: Rename the map handler in a check handler

2021-07-01 Thread Cédric Le Goater
This moves the IRQ initialization done under the different ICS backends
in the common part of XICS. The 'map' handler becomes a simple 'check'
on the HW IRQ at the FW level.

As we don't need an ICS anymore in xics_migrate_irqs_away(), the XICS
domain does not set a chip data for the IRQ.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xics.h|  3 ++-
 arch/powerpc/sysdev/xics/ics-native.c  | 13 +---
 arch/powerpc/sysdev/xics/ics-opal.c| 27 +
 arch/powerpc/sysdev/xics/ics-rtas.c| 28 +-
 arch/powerpc/sysdev/xics/xics-common.c | 15 --
 5 files changed, 36 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/include/asm/xics.h b/arch/powerpc/include/asm/xics.h
index 584dcf903590..e76d835dc03f 100644
--- a/arch/powerpc/include/asm/xics.h
+++ b/arch/powerpc/include/asm/xics.h
@@ -89,10 +89,11 @@ static inline int ics_opal_init(void) { return -ENODEV; }
 /* ICS instance, hooked up to chip_data of an irq */
 struct ics {
struct list_head link;
-   int (*map)(struct ics *ics, unsigned int virq);
+   int (*check)(struct ics *ics, unsigned int hwirq);
void (*mask_unknown)(struct ics *ics, unsigned long vec);
long (*get_server)(struct ics *ics, unsigned long vec);
int (*host_match)(struct ics *ics, struct device_node *node);
+   struct irq_chip *chip;
char data[];
 };
 
diff --git a/arch/powerpc/sysdev/xics/ics-native.c 
b/arch/powerpc/sysdev/xics/ics-native.c
index d450502f4053..dec7d93a8ba1 100644
--- a/arch/powerpc/sysdev/xics/ics-native.c
+++ b/arch/powerpc/sysdev/xics/ics-native.c
@@ -131,19 +131,15 @@ static struct irq_chip ics_native_irq_chip = {
.irq_retrigger  = xics_retrigger,
 };
 
-static int ics_native_map(struct ics *ics, unsigned int virq)
+static int ics_native_check(struct ics *ics, unsigned int hw_irq)
 {
-   unsigned int vec = (unsigned int)virq_to_hw(virq);
struct ics_native *in = to_ics_native(ics);
 
-   pr_devel("%s: vec=0x%x\n", __func__, vec);
+   pr_devel("%s: hw_irq=0x%x\n", __func__, hw_irq);
 
-   if (vec < in->ibase || vec >= (in->ibase + in->icount))
+   if (hw_irq < in->ibase || hw_irq >= (in->ibase + in->icount))
return -EINVAL;
 
-   irq_set_chip_and_handler(virq, _native_irq_chip, 
handle_fasteoi_irq);
-   irq_set_chip_data(virq, ics);
-
return 0;
 }
 
@@ -177,10 +173,11 @@ static int ics_native_host_match(struct ics *ics, struct 
device_node *node)
 }
 
 static struct ics ics_native_template = {
-   .map= ics_native_map,
+   .check  = ics_native_check,
.mask_unknown   = ics_native_mask_unknown,
.get_server = ics_native_get_server,
.host_match = ics_native_host_match,
+   .chip = _native_irq_chip,
 };
 
 static int __init ics_native_add_one(struct device_node *np)
diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index 823f6c9664cd..8c7ddcc718b6 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -157,26 +157,13 @@ static struct irq_chip ics_opal_irq_chip = {
.irq_retrigger = xics_retrigger,
 };
 
-static int ics_opal_map(struct ics *ics, unsigned int virq);
-static void ics_opal_mask_unknown(struct ics *ics, unsigned long vec);
-static long ics_opal_get_server(struct ics *ics, unsigned long vec);
-
 static int ics_opal_host_match(struct ics *ics, struct device_node *node)
 {
return 1;
 }
 
-/* Only one global & state struct ics */
-static struct ics ics_hal = {
-   .map= ics_opal_map,
-   .mask_unknown   = ics_opal_mask_unknown,
-   .get_server = ics_opal_get_server,
-   .host_match = ics_opal_host_match,
-};
-
-static int ics_opal_map(struct ics *ics, unsigned int virq)
+static int ics_opal_check(struct ics *ics, unsigned int hw_irq)
 {
-   unsigned int hw_irq = (unsigned int)virq_to_hw(virq);
int64_t rc;
__be16 server;
int8_t priority;
@@ -189,9 +176,6 @@ static int ics_opal_map(struct ics *ics, unsigned int virq)
if (rc != OPAL_SUCCESS)
return -ENXIO;
 
-   irq_set_chip_and_handler(virq, _opal_irq_chip, handle_fasteoi_irq);
-   irq_set_chip_data(virq, _hal);
-
return 0;
 }
 
@@ -222,6 +206,15 @@ static long ics_opal_get_server(struct ics *ics, unsigned 
long vec)
return ics_opal_unmangle_server(be16_to_cpu(server));
 }
 
+/* Only one global & state struct ics */
+static struct ics ics_hal = {
+   .check  = ics_opal_check,
+   .mask_unknown   = ics_opal_mask_unknown,
+   .get_server = ics_opal_get_server,
+   .host_match = ics_opal_host_match,
+   .chip   = _opal_irq_chip,
+};
+
 int __init ics_opal_init(void)
 {
if (!firmware_has_feature(FW_FEATURE_OPAL))
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c 

[PATCH v2 20/32] powerpc/xics: Add support for IRQ domain hierarchy

2021-07-01 Thread Cédric Le Goater
XICS doesn't have any state associated with the IRQ. The support is
straightforward and simpler than for XIVE.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 41 ++
 1 file changed, 41 insertions(+)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 419d91bffec3..e82d0d4ddec0 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -406,7 +406,48 @@ int xics_retrigger(struct irq_data *data)
return 0;
 }
 
+#ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
+static int xics_host_domain_translate(struct irq_domain *d, struct irq_fwspec 
*fwspec,
+ unsigned long *hwirq, unsigned int *type)
+{
+   return xics_host_xlate(d, to_of_node(fwspec->fwnode), fwspec->param,
+  fwspec->param_count, hwirq, type);
+}
+
+static int xics_host_domain_alloc(struct irq_domain *domain, unsigned int virq,
+ unsigned int nr_irqs, void *arg)
+{
+   struct irq_fwspec *fwspec = arg;
+   irq_hw_number_t hwirq;
+   unsigned int type = IRQ_TYPE_NONE;
+   int i, rc;
+
+   rc = xics_host_domain_translate(domain, fwspec, , );
+   if (rc)
+   return rc;
+
+   pr_debug("%s %d/%lx #%d\n", __func__, virq, hwirq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++)
+   irq_domain_set_info(domain, virq + i, hwirq + i, xics_ics->chip,
+   xics_ics, handle_fasteoi_irq, NULL, NULL);
+
+   return 0;
+}
+
+static void xics_host_domain_free(struct irq_domain *domain,
+ unsigned int virq, unsigned int nr_irqs)
+{
+   pr_debug("%s %d #%d\n", __func__, virq, nr_irqs);
+}
+#endif
+
 static const struct irq_domain_ops xics_host_ops = {
+#ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
+   .alloc  = xics_host_domain_alloc,
+   .free   = xics_host_domain_free,
+   .translate = xics_host_domain_translate,
+#endif
.match = xics_host_match,
.map = xics_host_map,
.xlate = xics_host_xlate,
-- 
2.31.1



[PATCH v2 19/32] powerpc/xics: Add debug logging to the set_irq_affinity handlers

2021-07-01 Thread Cédric Le Goater
It really helps to know how the HW is configured when tweaking the IRQ
subsystem.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/ics-opal.c | 2 +-
 arch/powerpc/sysdev/xics/ics-rtas.c | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index 8c7ddcc718b6..bf26cae1b982 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -133,7 +133,7 @@ static int ics_opal_set_affinity(struct irq_data *d,
}
server = ics_opal_mangle_server(wanted_server);
 
-   pr_devel("ics-hal: set-affinity irq %d [hw 0x%x] server: 0x%x/0x%x\n",
+   pr_debug("ics-hal: set-affinity irq %d [hw 0x%x] server: 0x%x/0x%x\n",
 d->irq, hw_irq, wanted_server, server);
 
rc = opal_set_xive(hw_irq, server, priority);
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c 
b/arch/powerpc/sysdev/xics/ics-rtas.c
index 6d19d711ed35..b50c6341682e 100644
--- a/arch/powerpc/sysdev/xics/ics-rtas.c
+++ b/arch/powerpc/sysdev/xics/ics-rtas.c
@@ -133,6 +133,9 @@ static int ics_rtas_set_affinity(struct irq_data *d,
return -1;
}
 
+   pr_debug("%s: irq %d [hw 0x%x] server: 0x%x\n", __func__, d->irq,
+hw_irq, irq_server);
+
status = rtas_call_reentrant(ibm_set_xive, 3, 1, NULL,
 hw_irq, irq_server, xics_status[1]);
 
-- 
2.31.1



Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-07-01 Thread Valentin Schneider
On 01/07/21 09:45, Srikar Dronamraju wrote:
> @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
>  void sched_domains_numa_masks_set(unsigned int cpu)
>  {
>   int node = cpu_to_node(cpu);
> - int i, j;
> + int i, j, empty;
>
> + empty = cpumask_empty(sched_domains_numa_masks[0][node]);
>   for (i = 0; i < sched_domains_numa_levels; i++) {
>   for (j = 0; j < nr_node_ids; j++) {
> - if (node_distance(j, node) <= 
> sched_domains_numa_distance[i])
> + if (!node_online(j))
> + continue;
> +
> + if (node_distance(j, node) <= 
> sched_domains_numa_distance[i]) {
>   cpumask_set_cpu(cpu, 
> sched_domains_numa_masks[i][j]);
> +
> + /*
> +  * We skip updating numa_masks for offline
> +  * nodes. However now that the node is
> +  * finally online, CPUs that were added
> +  * earlier, should now be accommodated into
> +  * newly oneline node's numa mask.
> +  */
> + if (node != j && empty) {
> + 
> cpumask_or(sched_domains_numa_masks[i][node],
> + 
> sched_domains_numa_masks[i][node],
> + 
> sched_domains_numa_masks[0][j]);
> + }
> + }

Hmph, so we're playing games with masks of offline nodes - is that really
necessary? Your modification of sched_init_numa() still scans all of the
nodes (regardless of their online status) to build the distance map, and
that is never updated (sched_init_numa() is pretty much an __init
function).

So AFAICT this is all to cope with topology_span_sane() not applying
'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
light of having bogus distance values for offline nodes, not so much...

What about the below instead?

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..c2d9caad4aa6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct 
sched_domain_topology_leve
 static bool topology_span_sane(struct sched_domain_topology_level *tl,
  const struct cpumask *cpu_map, int cpu)
 {
+   struct cpumask *intersect = sched_domains_tmpmask;
int i;
 
/* NUMA levels are allowed to overlap */
@@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct 
sched_domain_topology_level *tl,
for_each_cpu(i, cpu_map) {
if (i == cpu)
continue;
+
/*
-* We should 'and' all those masks with 'cpu_map' to exactly
-* match the topology we're about to build, but that can only
-* remove CPUs, which only lessens our ability to detect
-* overlaps
+* We shouldn't have to bother with cpu_map here, unfortunately
+* some architectures (powerpc says hello) have to deal with
+* offline NUMA nodes reporting bogus distance values. This can
+* lead to funky NODE domain spans, but since those are offline
+* we can mask them out.
 */
+   cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
-   cpumask_intersects(tl->mask(cpu), tl->mask(i)))
+   cpumask_intersects(intersect, cpu_map))
return false;
}
 


[PATCH v2 07/32] powerpc/xive: Remove irqd_is_started() check when setting the affinity

2021-07-01 Thread Cédric Le Goater
In the early days of XIVE support, commit cffb717ceb8e ("powerpc/xive:
Ensure active irqd when setting affinity") tried to fix an issue
related to interrupt migration. If the root cause was related to CPU
unplug, it should have been fixed and there is no reason to keep the
irqd_is_started() check. This test is also breaking affinity setting
of MSIs which can set before starting the associated IRQ.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index a03057bfccfd..38183c9b21c0 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -719,10 +719,6 @@ static int xive_irq_set_affinity(struct irq_data *d,
if (cpumask_any_and(cpumask, cpu_online_mask) >= nr_cpu_ids)
return -EINVAL;
 
-   /* Don't do anything if the interrupt isn't started */
-   if (!irqd_is_started(d))
-   return IRQ_SET_MASK_OK;
-
/*
 * If existing target is already in the new mask, and is
 * online then do nothing.
-- 
2.31.1



[PATCH] powerpc/mm: Fix lockup on kernel exec fault

2021-07-01 Thread Christophe Leroy
The powerpc kernel is not prepared to handle exec faults from kernel.
Especially, the function is_exec_fault() will return 'false' when an
exec fault is taken by kernel, because the check is based on reading
current->thread.regs->trap which contains the trap from user.

For instance, when provoking a LKDTM EXEC_USERSPACE test,
current->thread.regs->trap is set to SYSCALL trap (0xc00), and
the fault taken by the kernel is not seen as an exec fault by
set_access_flags_filter().

Commit d7df2443cd5f ("powerpc/mm: Fix spurrious segfaults on radix
with autonuma") made it clear and handled it properly. But later on
commit d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute
faults") removed that handling, introducing test based on error_code.
And here is the problem, because on the 603 all upper bits of SRR1
get cleared when the TLB instruction miss handler bails out to ISI.

Until commit cbd7e6ca0210 ("powerpc/fault: Avoid heavy
search_exception_tables() verification"), an exec fault from kernel
at a userspace address was indirectly caught by the lack of entry for
that address in the exception tables. But after that commit the
kernel mainly rely on KUAP or on core mm handling to catch wrong
user accesses. Here the access is not wrong, so mm handles it.
It is a minor fault because PAGE_EXEC is not set,
set_access_flags_filter() should set PAGE_EXEC and voila.
But as is_exec_fault() returns false as explained in the begining,
set_access_flags_filter() bails out without setting PAGE_EXEC flag,
which leads to a forever minor exec fault.

As the kernel is not prepared to handle such exec faults, the thing
to do is to fire in bad_kernel_fault() for any exec fault taken by
the kernel, as it was prior to commit d3ca587404b3.

Fixes: d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute faults")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/fault.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 34f641d4a2fe..a8d0ce85d39a 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -199,9 +199,7 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned 
long error_code,
 {
int is_exec = TRAP(regs) == INTERRUPT_INST_STORAGE;
 
-   /* NX faults set DSISR_PROTFAULT on the 8xx, DSISR_NOEXEC_OR_G on 
others */
-   if (is_exec && (error_code & (DSISR_NOEXEC_OR_G | DSISR_KEYFAULT |
- DSISR_PROTFAULT))) {
+   if (is_exec) {
pr_crit_ratelimited("kernel tried to execute %s page (%lx) - 
exploit attempt? (uid: %d)\n",
address >= TASK_SIZE ? "exec-protected" : 
"user",
address,
-- 
2.25.0



[PATCH] powerpc/xmon: Use ARRAY_SIZE

2021-07-01 Thread Jason Wang
The ARRAY_SIZE macro is more compact and formal to get array size in
linux kernel source. In addition, it is more readable for kernel
developpers. Thus, we can replace all sizeof(arr)/sizeof(arr[0]) with
ARRAY_SIZE.

Signed-off-by: Jason Wang 
---
 arch/powerpc/xmon/ppc-opc.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/xmon/ppc-opc.c b/arch/powerpc/xmon/ppc-opc.c
index dfb80810b16c..e1d292fe6c6e 100644
--- a/arch/powerpc/xmon/ppc-opc.c
+++ b/arch/powerpc/xmon/ppc-opc.c
@@ -954,8 +954,7 @@ const struct powerpc_operand powerpc_operands[] =
   { 0xff, 11, NULL, NULL, PPC_OPERAND_SIGNOPT },
 };
 
-const unsigned int num_powerpc_operands = (sizeof (powerpc_operands)
-  / sizeof (powerpc_operands[0]));
+const unsigned int num_powerpc_operands = ARRAY_SIZE(powerpc_operands);
 
 /* The functions used to insert and extract complicated operands.  */
 
@@ -6968,9 +6967,8 @@ const struct powerpc_opcode powerpc_opcodes[] = {
 {"fcfidu.",XRC(63,974,1),  XRA_MASK, POWER7|PPCA2, PPCVLE, {FRT, 
FRB}},
 };
 
-const int powerpc_num_opcodes =
-  sizeof (powerpc_opcodes) / sizeof (powerpc_opcodes[0]);
-
+const int powerpc_num_opcodes = ARRAY_SIZE(powerpc_opcodes);
+
 /* The VLE opcode table.
 
The format of this opcode table is the same as the main opcode table.  */
@@ -7207,9 +7205,8 @@ const struct powerpc_opcode vle_opcodes[] = {
 {"se_bl",  BD8(58,0,1),BD8_MASK,   PPCVLE, 0,  {B8}},
 };
 
-const int vle_num_opcodes =
-  sizeof (vle_opcodes) / sizeof (vle_opcodes[0]);
-
+const int vle_num_opcodes = ARRAY_SIZE(vle_opcodes);
+
 /* The macro table.  This is only used by the assembler.  */
 
 /* The expressions of the form (-x ! 31) & (x | 31) have the value 0
@@ -7276,5 +7273,4 @@ const struct powerpc_macro powerpc_macros[] = {
 {"e_clrlslwi",4, PPCVLE, "e_rlwinm %0,%1,%3,(%2)-(%3),31-(%3)"},
 };
 
-const int powerpc_num_macros =
-  sizeof (powerpc_macros) / sizeof (powerpc_macros[0]);
+const int powerpc_num_macros = ARRAY_SIZE(powerpc_macros);
-- 
2.31.1





[PATCH] sched: Use WARN_ON

2021-07-01 Thread Jason Wang
The BUG_ON macro simplifies the if condition followed by BUG, but it
will lead to the kernel crashing. Therefore, we can try using WARN_ON
instead of if condition followed by BUG.

Signed-off-by: Jason Wang 
---
 arch/powerpc/platforms/cell/spufs/sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index 369206489895..0f218d9e5733 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -904,8 +904,8 @@ static noinline void spusched_tick(struct spu_context *ctx)
struct spu_context *new = NULL;
struct spu *spu = NULL;
 
-   if (spu_acquire(ctx))
-   BUG();  /* a kernel thread never has signals pending */
+   /* a kernel thread never has signals pending */
+   WARN_ON(spu_acquire(ctx));
 
if (ctx->state != SPU_STATE_RUNNABLE)
goto out;
-- 
2.32.0





[PATCH v2 26/32] powerpc/powernv/pci: Adapt is_pnv_opal_msi() to detect passthrough interrupt

2021-07-01 Thread Cédric Le Goater
The pnv_ioda2_msi_eoi() chip handler is not used anymore for MSIs.
Simply use the check on the PSI-MSI chip.

Cc: Alexey Kardashevskiy 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index eb38ce1fd434..6c4b37598bcc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2015,7 +2015,7 @@ static struct irq_chip pnv_pci_msi_irq_chip;
  */
 bool is_pnv_opal_msi(struct irq_chip *chip)
 {
-   return chip->irq_eoi == pnv_ioda2_msi_eoi || chip == 
_pci_msi_irq_chip;
+   return chip == _pci_msi_irq_chip;
 }
 EXPORT_SYMBOL_GPL(is_pnv_opal_msi);
 
-- 
2.31.1



[PATCH v2 14/32] KVM: PPC: Book3S HV: XIVE: Change interface of passthrough interrupt routines

2021-07-01 Thread Cédric Le Goater
The routine kvmppc_set_passthru_irq() calls kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped() with an IRQ descriptor. Use directly the
host IRQ number to remove a useless conversion.

Add some debug.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/kvm_ppc.h |  4 ++--
 arch/powerpc/kvm/book3s_hv.c   |  4 ++--
 arch/powerpc/kvm/book3s_xive.c | 17 -
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2d88944f9f34..671fbd1a765e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -664,9 +664,9 @@ extern int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
struct kvm_vcpu *vcpu, u32 cpu);
 extern void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
- struct irq_desc *host_desc);
+ unsigned long host_irq);
 extern int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
- struct irq_desc *host_desc);
+ unsigned long host_irq);
 extern u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu);
 extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 048b4ca55cfe..965178aeff13 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5303,7 +5303,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
pimap->n_mapped++;
 
if (xics_on_xive())
-   rc = kvmppc_xive_set_mapped(kvm, guest_gsi, desc);
+   rc = kvmppc_xive_set_mapped(kvm, guest_gsi, host_irq);
else
kvmppc_xics_set_mapped(kvm, guest_gsi, desc->irq_data.hwirq);
if (rc)
@@ -5344,7 +5344,7 @@ static int kvmppc_clr_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
}
 
if (xics_on_xive())
-   rc = kvmppc_xive_clr_mapped(kvm, guest_gsi, 
pimap->mapped[i].desc);
+   rc = kvmppc_xive_clr_mapped(kvm, guest_gsi, host_irq);
else
kvmppc_xics_clr_mapped(kvm, guest_gsi, 
pimap->mapped[i].r_hwirq);
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 9268d386b128..434da541a20b 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -921,13 +921,12 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
 }
 
 int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
-  struct irq_desc *host_desc)
+  unsigned long host_irq)
 {
struct kvmppc_xive *xive = kvm->arch.xive;
struct kvmppc_xive_src_block *sb;
struct kvmppc_xive_irq_state *state;
-   struct irq_data *host_data = irq_desc_get_irq_data(host_desc);
-   unsigned int host_irq = irq_desc_get_irq(host_desc);
+   struct irq_data *host_data = irq_get_irq_data(host_irq);
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(host_data);
u16 idx;
u8 prio;
@@ -936,7 +935,8 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
if (!xive)
return -ENODEV;
 
-   pr_devel("set_mapped girq 0x%lx host HW irq 0x%x...\n",guest_irq, 
hw_irq);
+   pr_debug("%s: GIRQ 0x%lx host IRQ %ld XIVE HW IRQ 0x%x\n",
+__func__, guest_irq, host_irq, hw_irq);
 
sb = kvmppc_xive_find_source(xive, guest_irq, );
if (!sb)
@@ -958,7 +958,7 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
 */
rc = irq_set_vcpu_affinity(host_irq, state);
if (rc) {
-   pr_err("Failed to set VCPU affinity for irq %d\n", host_irq);
+   pr_err("Failed to set VCPU affinity for host IRQ %ld\n", 
host_irq);
return rc;
}
 
@@ -1018,12 +1018,11 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned 
long guest_irq,
 EXPORT_SYMBOL_GPL(kvmppc_xive_set_mapped);
 
 int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
-  struct irq_desc *host_desc)
+  unsigned long host_irq)
 {
struct kvmppc_xive *xive = kvm->arch.xive;
struct kvmppc_xive_src_block *sb;
struct kvmppc_xive_irq_state *state;
-   unsigned int host_irq = irq_desc_get_irq(host_desc);
u16 idx;
u8 prio;
int rc;
@@ -1031,7 +1030,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long 
guest_irq,
if (!xive)
return -ENODEV;
 
-   pr_devel("clr_mapped girq 0x%lx...\n", guest_irq);
+   pr_debug("%s: GIRQ 0x%lx host IRQ %ld\n", __func__, guest_irq, 
host_irq);
 
sb = kvmppc_xive_find_source(xive, guest_irq, );

[PATCH v2 23/32] powerpc/xics: Drop unmask of MSIs at startup

2021-07-01 Thread Cédric Le Goater
That was a workaround in the XICS domain because of the lack of MSI
domain. This is now handled.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/ics-opal.c | 11 ---
 arch/powerpc/sysdev/xics/ics-rtas.c |  9 -
 2 files changed, 20 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index bf26cae1b982..c4d95d8beb6f 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -62,17 +62,6 @@ static void ics_opal_unmask_irq(struct irq_data *d)
 
 static unsigned int ics_opal_startup(struct irq_data *d)
 {
-#ifdef CONFIG_PCI_MSI
-   /*
-* The generic MSI code returns with the interrupt disabled on the
-* card, using the MSI mask bits. Firmware doesn't appear to unmask
-* at that level, so we do it here by hand.
-*/
-   if (irq_data_get_msi_desc(d))
-   pci_msi_unmask_irq(d);
-#endif
-
-   /* unmask it */
ics_opal_unmask_irq(d);
return 0;
 }
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c 
b/arch/powerpc/sysdev/xics/ics-rtas.c
index b50c6341682e..b9da317b7a2d 100644
--- a/arch/powerpc/sysdev/xics/ics-rtas.c
+++ b/arch/powerpc/sysdev/xics/ics-rtas.c
@@ -57,15 +57,6 @@ static void ics_rtas_unmask_irq(struct irq_data *d)
 
 static unsigned int ics_rtas_startup(struct irq_data *d)
 {
-#ifdef CONFIG_PCI_MSI
-   /*
-* The generic MSI code returns with the interrupt disabled on the
-* card, using the MSI mask bits. Firmware doesn't appear to unmask
-* at that level, so we do it here by hand.
-*/
-   if (irq_data_get_msi_desc(d))
-   pci_msi_unmask_irq(d);
-#endif
/* unmask it */
ics_rtas_unmask_irq(d);
return 0;
-- 
2.31.1



[PATCH v2 31/32] powerpc/xive: Use XIVE domain under xmon and debugfs

2021-07-01 Thread Cédric Le Goater
The default domain of the PCI/MSIs is not the XIVE domain anymore. To
list the IRQ mappings under XMON and debugfs, query the IRQ data from
the low level XIVE domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index f0012d6b4fe9..f8ff558bc305 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -322,11 +322,10 @@ void xmon_xive_get_irq_all(void)
struct irq_desc *desc;
 
for_each_irq_desc(i, desc) {
-   struct irq_data *d = irq_desc_get_irq_data(desc);
-   unsigned int hwirq = (unsigned int)irqd_to_hwirq(d);
+   struct irq_data *d = irq_domain_get_irq_data(xive_irq_domain, 
i);
 
-   if (d->domain == xive_irq_domain)
-   xmon_xive_get_irq_config(hwirq, d);
+   if (d)
+   xmon_xive_get_irq_config(irqd_to_hwirq(d), d);
}
 }
 
@@ -1766,9 +1765,9 @@ static int xive_core_debug_show(struct seq_file *m, void 
*private)
xive_debug_show_cpu(m, cpu);
 
for_each_irq_desc(i, desc) {
-   struct irq_data *d = irq_desc_get_irq_data(desc);
+   struct irq_data *d = irq_domain_get_irq_data(xive_irq_domain, 
i);
 
-   if (d->domain == xive_irq_domain)
+   if (d)
xive_debug_show_irq(m, d);
}
return 0;
-- 
2.31.1



[PATCH v2 30/32] KVM: PPC: Book3S HV: XICS: Fix mapping of passthrough interrupts

2021-07-01 Thread Cédric Le Goater
PCI MSIs now live in an MSI domain but the underlying calls, which
will EOI the interrupt in real mode, need an HW IRQ number mapped in
the XICS IRQ domain. Grab it there.

Cc: Alexey Kardashevskiy 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_hv.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 965178aeff13..1afbe91c6ca1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5233,6 +5233,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
struct kvmppc_passthru_irqmap *pimap;
struct irq_chip *chip;
int i, rc = 0;
+   struct irq_data *host_data;
 
if (!kvm_irq_bypass)
return 1;
@@ -5297,7 +5298,14 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
 * the KVM real mode handler.
 */
smp_wmb();
-   irq_map->r_hwirq = desc->irq_data.hwirq;
+
+   /*
+* The 'host_irq' number is mapped in the PCI-MSI domain but
+* the underlying calls, which will EOI the interrupt in real
+* mode, need an HW IRQ number mapped in the XICS IRQ domain.
+*/
+   host_data = irq_domain_get_irq_data(irq_get_default_host(), host_irq);
+   irq_map->r_hwirq = (unsigned int)irqd_to_hwirq(host_data);
 
if (i == pimap->n_mapped)
pimap->n_mapped++;
@@ -5305,7 +5313,7 @@ static int kvmppc_set_passthru_irq(struct kvm *kvm, int 
host_irq, int guest_gsi)
if (xics_on_xive())
rc = kvmppc_xive_set_mapped(kvm, guest_gsi, host_irq);
else
-   kvmppc_xics_set_mapped(kvm, guest_gsi, desc->irq_data.hwirq);
+   kvmppc_xics_set_mapped(kvm, guest_gsi, irq_map->r_hwirq);
if (rc)
irq_map->r_hwirq = 0;
 
-- 
2.31.1



[PATCH v2 22/32] powerpc/pci: Drop XIVE restriction on MSI domains

2021-07-01 Thread Cédric Le Goater
The PowerNV and pSeries platforms now have support for both the XICS
and XIVE IRQ domains.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 +---
 arch/powerpc/platforms/pseries/msi.c  | 4 
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index b498876a976f..e2454439e574 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2332,9 +2332,7 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
pr_info("  Allocated bitmap for %d MSIs (base IRQ 0x%x)\n",
count, phb->msi_base);
 
-   /* Only supported by the XIVE driver */
-   if (xive_enabled())
-   pnv_msi_allocate_domains(phb->hose, count);
+   pnv_msi_allocate_domains(phb->hose, count);
 }
 
 static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe *pe,
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index e2127a3f7ebd..e196cc1b8540 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -720,10 +720,6 @@ int pseries_msi_allocate_domains(struct pci_controller 
*phb)
 {
int count;
 
-   /* Only supported by the XIVE driver */
-   if (!xive_enabled())
-   return -ENODEV;
-
if (!__find_pe_total_msi(phb->dn, )) {
pr_err("PCI: failed to find MSIs for bridge %pOF (domain %d)\n",
   phb->dn, phb->global_number);
-- 
2.31.1



[PATCH v2 16/32] powerpc/xics: Remove ICS list

2021-07-01 Thread Cédric Le Goater
We always had only one ICS per machine. Simplify the XICS driver by
removing the ICS list.

The ICS stored in the chip data of the XICS domain becomes useless and
we don't need it anymore to migrate away IRQs from a CPU. This will be
removed in a subsequent patch.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 45 +++---
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 7c561a612366..05e5e7d84ca7 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -38,7 +38,7 @@ DEFINE_PER_CPU(struct xics_cppr, xics_cppr);
 
 struct irq_domain *xics_host;
 
-static LIST_HEAD(ics_list);
+static struct ics *xics_ics;
 
 void xics_update_irq_servers(void)
 {
@@ -111,12 +111,11 @@ void xics_setup_cpu(void)
 
 void xics_mask_unknown_vec(unsigned int vec)
 {
-   struct ics *ics;
-
pr_err("Interrupt 0x%x (real) is invalid, disabling it.\n", vec);
 
-   list_for_each_entry(ics, _list, link)
-   ics->mask_unknown(ics, vec);
+   if (WARN_ON(!xics_ics))
+   return;
+   xics_ics->mask_unknown(xics_ics, vec);
 }
 
 
@@ -198,7 +197,6 @@ void xics_migrate_irqs_away(void)
struct irq_chip *chip;
long server;
unsigned long flags;
-   struct ics *ics;
 
/* We can't set affinity on ISA interrupts */
if (virq < NUM_ISA_INTERRUPTS)
@@ -219,13 +217,10 @@ void xics_migrate_irqs_away(void)
raw_spin_lock_irqsave(>lock, flags);
 
/* Locate interrupt server */
-   server = -1;
-   ics = irq_desc_get_chip_data(desc);
-   if (ics)
-   server = ics->get_server(ics, irq);
+   server = xics_ics->get_server(xics_ics, irq);
if (server < 0) {
-   printk(KERN_ERR "%s: Can't find server for irq %d\n",
-  __func__, irq);
+   pr_err("%s: Can't find server for irq %d/%x\n",
+  __func__, virq, irq);
goto unlock;
}
 
@@ -307,13 +302,9 @@ int xics_get_irq_server(unsigned int virq, const struct 
cpumask *cpumask,
 static int xics_host_match(struct irq_domain *h, struct device_node *node,
   enum irq_domain_bus_token bus_token)
 {
-   struct ics *ics;
-
-   list_for_each_entry(ics, _list, link)
-   if (ics->host_match(ics, node))
-   return 1;
-
-   return 0;
+   if (WARN_ON(!xics_ics))
+   return 0;
+   return xics_ics->host_match(xics_ics, node) ? 1 : 0;
 }
 
 /* Dummies */
@@ -330,8 +321,6 @@ static struct irq_chip xics_ipi_chip = {
 static int xics_host_map(struct irq_domain *h, unsigned int virq,
 irq_hw_number_t hw)
 {
-   struct ics *ics;
-
pr_devel("xics: map virq %d, hwirq 0x%lx\n", virq, hw);
 
/*
@@ -348,12 +337,14 @@ static int xics_host_map(struct irq_domain *h, unsigned 
int virq,
return 0;
}
 
+   if (WARN_ON(!xics_ics))
+   return -EINVAL;
+
/* Let the ICS setup the chip data */
-   list_for_each_entry(ics, _list, link)
-   if (ics->map(ics, virq) == 0)
-   return 0;
+   if (xics_ics->map(xics_ics, virq))
+   return -EINVAL;
 
-   return -EINVAL;
+   return 0;
 }
 
 static int xics_host_xlate(struct irq_domain *h, struct device_node *ct,
@@ -427,7 +418,9 @@ static void __init xics_init_host(void)
 
 void __init xics_register_ics(struct ics *ics)
 {
-   list_add(>link, _list);
+   if (WARN_ONCE(xics_ics, "XICS: Source Controller is already defined !"))
+   return;
+   xics_ics = ics;
 }
 
 static void __init xics_get_server_size(void)
-- 
2.31.1



[PATCH v2 02/32] powerpc/pseries/pci: Introduce rtas_prepare_msi_irqs()

2021-07-01 Thread Cédric Le Goater
This splits the routine setting the MSIs in two parts: allocation of
MSIs for the PCI device at the FW level (RTAS) and the actual mapping
and activation of the IRQs.

rtas_prepare_msi_irqs() will serve as a handler for the PCI MSI domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index d2d090e04745..4bf14f27e1aa 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -373,12 +373,11 @@ static void rtas_hack_32bit_msi_gen2(struct pci_dev *pdev)
pci_write_config_dword(pdev, pdev->msi_cap + PCI_MSI_ADDRESS_HI, 0);
 }
 
-static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
+static int rtas_prepare_msi_irqs(struct pci_dev *pdev, int nvec_in, int type,
+msi_alloc_info_t *arg)
 {
struct pci_dn *pdn;
-   int hwirq, virq, i, quota, rc;
-   struct msi_desc *entry;
-   struct msi_msg msg;
+   int quota, rc;
int nvec = nvec_in;
int use_32bit_msi_hack = 0;
 
@@ -456,6 +455,22 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type)
return rc;
}
 
+   return 0;
+}
+
+static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
+{
+   struct pci_dn *pdn;
+   int hwirq, virq, i;
+   int rc;
+   struct msi_desc *entry;
+   struct msi_msg msg;
+
+   rc = rtas_prepare_msi_irqs(pdev, nvec_in, type, NULL);
+   if (rc)
+   return rc;
+
+   pdn = pci_get_pdn(pdev);
i = 0;
for_each_pci_msi_entry(entry, pdev) {
hwirq = rtas_query_irq_number(pdn, i++);
-- 
2.31.1



[PATCH v2 18/32] powerpc/xics: Give a name to the default XICS IRQ domain

2021-07-01 Thread Cédric Le Goater
and clean up the error path.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xics/xics-common.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index 399dd5becf65..419d91bffec3 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -412,11 +412,22 @@ static const struct irq_domain_ops xics_host_ops = {
.xlate = xics_host_xlate,
 };
 
-static void __init xics_init_host(void)
+static int __init xics_allocate_domain(void)
 {
-   xics_host = irq_domain_add_tree(NULL, _host_ops, NULL);
-   BUG_ON(xics_host == NULL);
+   struct fwnode_handle *fn;
+
+   fn = irq_domain_alloc_named_fwnode("XICS");
+   if (!fn)
+   return -ENOMEM;
+
+   xics_host = irq_domain_create_tree(fn, _host_ops, NULL);
+   if (!xics_host) {
+   irq_domain_free_fwnode(fn);
+   return -ENOMEM;
+   }
+
irq_set_default_host(xics_host);
+   return 0;
 }
 
 void __init xics_register_ics(struct ics *ics)
@@ -480,6 +491,8 @@ void __init xics_init(void)
/* Initialize common bits */
xics_get_server_size();
xics_update_irq_servers();
-   xics_init_host();
+   rc = xics_allocate_domain();
+   if (rc < 0)
+   pr_err("XICS: Failed to create IRQ domain");
xics_setup_cpu();
 }
-- 
2.31.1



[PATCH v2 28/32] powerpc/powernv/pci: Set the IRQ chip data for P8/CXL devices

2021-07-01 Thread Cédric Le Goater
Before MSI domains, the default IRQ chip of PHB3 MSIs was patched by
pnv_set_msi_irq_chip() with the custom EOI handler pnv_ioda2_msi_eoi()
and the owning PHB was deduced from the 'ioda.irq_chip' field. This
path has been deprecated by the MSI domains but it is still in use by
the P8 CAPI 'cxl' driver.

Rewriting this driver to support MSI would be a waste of time.
Nevertheless, we can still remove the IRQ chip patch and set the IRQ
chip data instead. This is cleaner.

Cc: Frederic Barrat 
Cc: Christophe Lombard 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6c4b37598bcc..aa97245eedbf 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1971,19 +1971,23 @@ int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, 
unsigned int hw_irq)
return opal_pci_msi_eoi(phb->opal_id, hw_irq);
 }
 
+/*
+ * The IRQ data is mapped in the XICS domain, with OPAL HW IRQ numbers
+ */
 static void pnv_ioda2_msi_eoi(struct irq_data *d)
 {
int64_t rc;
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
-   struct irq_chip *chip = irq_data_get_irq_chip(d);
+   struct pci_controller *hose = irq_data_get_irq_chip_data(d);
+   struct pnv_phb *phb = hose->private_data;
 
-   rc = pnv_opal_pci_msi_eoi(chip, hw_irq);
+   rc = opal_pci_msi_eoi(phb->opal_id, hw_irq);
WARN_ON_ONCE(rc);
 
icp_native_eoi(d);
 }
 
-
+/* P8/CXL only */
 void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq)
 {
struct irq_data *idata;
@@ -2005,6 +2009,7 @@ void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned 
int virq)
phb->ioda.irq_chip.irq_eoi = pnv_ioda2_msi_eoi;
}
irq_set_chip(virq, >ioda.irq_chip);
+   irq_set_chip_data(virq, phb->hose);
 }
 
 static struct irq_chip pnv_pci_msi_irq_chip;
-- 
2.31.1



[PATCH v2 32/32] genirq: Improve "hwirq" output in /proc and /sys/

2021-07-01 Thread Cédric Le Goater
The HW IRQ numbers generated by the PCI MSI layer can be quite large
on a pSeries machine when running under the IBM Hypervisor and they
appear as negative. Use '%lu' instead to show them correctly.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 kernel/irq/irqdesc.c | 2 +-
 kernel/irq/proc.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 4a617d7312a4..1d8b7fb6b366 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -188,7 +188,7 @@ static ssize_t hwirq_show(struct kobject *kobj,
 
raw_spin_lock_irq(>lock);
if (desc->irq_data.domain)
-   ret = sprintf(buf, "%d\n", (int)desc->irq_data.hwirq);
+   ret = sprintf(buf, "%lu\n", desc->irq_data.hwirq);
raw_spin_unlock_irq(>lock);
 
return ret;
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 7c5cd42df3b9..ee595ec09778 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -513,7 +513,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, " %8s", "None");
}
if (desc->irq_data.domain)
-   seq_printf(p, " %*d", prec, (int) desc->irq_data.hwirq);
+   seq_printf(p, " %*lu", prec, desc->irq_data.hwirq);
else
seq_printf(p, " %*s", prec, "");
 #ifdef CONFIG_GENERIC_IRQ_SHOW_LEVEL
-- 
2.31.1



[PATCH v2 15/32] KVM: PPC: Book3S HV: XIVE: Fix mapping of passthrough interrupts

2021-07-01 Thread Cédric Le Goater
PCI MSI interrupt numbers are now mapped in a PCI-MSI domain but the
underlying calls handling the passthrough of the interrupt in the
guest need a number in the XIVE IRQ domain.

Use the IRQ data mapped in the XIVE IRQ domain and not the one in the
PCI-MSI domain.

Cc: Thomas Gleixner 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_xive.c | 3 ++-
 kernel/irq/irqdomain.c | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 434da541a20b..d30eb35cc7f0 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -926,7 +926,8 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
struct kvmppc_xive *xive = kvm->arch.xive;
struct kvmppc_xive_src_block *sb;
struct kvmppc_xive_irq_state *state;
-   struct irq_data *host_data = irq_get_irq_data(host_irq);
+   struct irq_data *host_data =
+   irq_domain_get_irq_data(irq_get_default_host(), host_irq);
unsigned int hw_irq = (unsigned int)irqd_to_hwirq(host_data);
u16 idx;
u8 prio;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 6284443b87ec..c8c06318dcbf 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -481,6 +481,7 @@ struct irq_domain *irq_get_default_host(void)
 {
return irq_default_domain;
 }
+EXPORT_SYMBOL_GPL(irq_get_default_host);
 
 static void irq_domain_clear_mapping(struct irq_domain *domain,
 irq_hw_number_t hwirq)
-- 
2.31.1



[PATCH v2] sched: Use BUG_ON

2021-07-01 Thread Jason Wang
The BUG_ON macro simplifies the if condition followed by BUG, so that
we can use BUG_ON instead of if condition followed by BUG.

Signed-off-by: Jason Wang 
---
 arch/powerpc/platforms/cell/spufs/sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index 369206489895..0f218d9e5733 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -904,8 +904,8 @@ static noinline void spusched_tick(struct spu_context *ctx)
struct spu_context *new = NULL;
struct spu *spu = NULL;
 
-   if (spu_acquire(ctx))
-   BUG();  /* a kernel thread never has signals pending */
+   /* a kernel thread never has signals pending */
+   BUG_ON(spu_acquire(ctx));
 
if (ctx->state != SPU_STATE_RUNNABLE)
goto out;
-- 
2.32.0





[PATCH v2 24/32] powerpc/pseries/pci: Drop unused MSI code

2021-07-01 Thread Cédric Le Goater
MSIs should be fully managed by the PCI and IRQ subsystems now.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/msi.c | 87 
 1 file changed, 87 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index e196cc1b8540..1b305e411862 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -111,21 +111,6 @@ static int rtas_query_irq_number(struct pci_dn *pdn, int 
offset)
return rtas_ret[0];
 }
 
-static void rtas_teardown_msi_irqs(struct pci_dev *pdev)
-{
-   struct msi_desc *entry;
-
-   for_each_pci_msi_entry(entry, pdev) {
-   if (!entry->irq)
-   continue;
-
-   irq_set_msi_desc(entry->irq, NULL);
-   irq_dispose_mapping(entry->irq);
-   }
-
-   rtas_disable_msi(pdev);
-}
-
 static int check_req(struct pci_dev *pdev, int nvec, char *prop_name)
 {
struct device_node *dn;
@@ -459,66 +444,6 @@ static int rtas_prepare_msi_irqs(struct pci_dev *pdev, int 
nvec_in, int type,
return 0;
 }
 
-static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
-{
-   struct pci_dn *pdn;
-   int hwirq, virq, i;
-   int rc;
-   struct msi_desc *entry;
-   struct msi_msg msg;
-
-   rc = rtas_prepare_msi_irqs(pdev, nvec_in, type, NULL);
-   if (rc)
-   return rc;
-
-   pdn = pci_get_pdn(pdev);
-   i = 0;
-   for_each_pci_msi_entry(entry, pdev) {
-   hwirq = rtas_query_irq_number(pdn, i++);
-   if (hwirq < 0) {
-   pr_debug("rtas_msi: error (%d) getting hwirq\n", rc);
-   return hwirq;
-   }
-
-   /*
-* Depending on the number of online CPUs in the original
-* kernel, it is likely for CPU #0 to be offline in a kdump
-* kernel. The associated IRQs in the affinity mappings
-* provided by irq_create_affinity_masks() are thus not
-* started by irq_startup(), as per-design for managed IRQs.
-* This can be a problem with multi-queue block devices driven
-* by blk-mq : such a non-started IRQ is very likely paired
-* with the single queue enforced by blk-mq during kdump (see
-* blk_mq_alloc_tag_set()). This causes the device to remain
-* silent and likely hangs the guest at some point.
-*
-* We don't really care for fine-grained affinity when doing
-* kdump actually : simply ignore the pre-computed affinity
-* masks in this case and let the default mask with all CPUs
-* be used when creating the IRQ mappings.
-*/
-   if (is_kdump_kernel())
-   virq = irq_create_mapping(NULL, hwirq);
-   else
-   virq = irq_create_mapping_affinity(NULL, hwirq,
-  entry->affinity);
-
-   if (!virq) {
-   pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
-   return -ENOSPC;
-   }
-
-   dev_dbg(>dev, "rtas_msi: allocated virq %d\n", virq);
-   irq_set_msi_desc(virq, entry);
-
-   /* Read config space back so we can restore after reset */
-   __pci_read_msi_msg(entry, );
-   entry->msg = msg;
-   }
-
-   return 0;
-}
-
 static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device 
*dev,
   int nvec, msi_alloc_info_t *arg)
 {
@@ -759,8 +684,6 @@ static void rtas_msi_pci_irq_fixup(struct pci_dev *pdev)
 
 static int rtas_msi_init(void)
 {
-   struct pci_controller *phb;
-
query_token  = rtas_token("ibm,query-interrupt-source-number");
change_token = rtas_token("ibm,change-msi");
 
@@ -772,16 +695,6 @@ static int rtas_msi_init(void)
 
pr_debug("rtas_msi: Registering RTAS MSI callbacks.\n");
 
-   WARN_ON(pseries_pci_controller_ops.setup_msi_irqs);
-   pseries_pci_controller_ops.setup_msi_irqs = rtas_setup_msi_irqs;
-   pseries_pci_controller_ops.teardown_msi_irqs = rtas_teardown_msi_irqs;
-
-   list_for_each_entry(phb, _list, list_node) {
-   WARN_ON(phb->controller_ops.setup_msi_irqs);
-   phb->controller_ops.setup_msi_irqs = rtas_setup_msi_irqs;
-   phb->controller_ops.teardown_msi_irqs = rtas_teardown_msi_irqs;
-   }
-
WARN_ON(ppc_md.pci_irq_fixup);
ppc_md.pci_irq_fixup = rtas_msi_pci_irq_fixup;
 
-- 
2.31.1



[PATCH v2 03/32] powerpc/xive: Add support for IRQ domain hierarchy

2021-07-01 Thread Cédric Le Goater
This adds handlers to allocate/free IRQs in a domain hierarchy. We
could try to use xive_irq_domain_map() in xive_irq_domain_alloc() but
we rely on xive_irq_alloc_data() to set the IRQ handler data and
duplicating the code is simpler.

xive_irq_free_data() needs to be called when IRQ are freed to clear
the MMIO mappings and free the XIVE handler data, xive_irq_data
structure. This is going to be a problem with MSI domains which we
will address later.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 64 +++
 1 file changed, 64 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index f985ed331a8c..834f1a378fc2 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1375,7 +1375,71 @@ static void xive_irq_domain_debug_show(struct seq_file 
*m, struct irq_domain *d,
 }
 #endif
 
+#ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
+static int xive_irq_domain_translate(struct irq_domain *d,
+struct irq_fwspec *fwspec,
+unsigned long *hwirq,
+unsigned int *type)
+{
+   return xive_irq_domain_xlate(d, to_of_node(fwspec->fwnode),
+fwspec->param, fwspec->param_count,
+hwirq, type);
+}
+
+static int xive_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
+unsigned int nr_irqs, void *arg)
+{
+   struct irq_fwspec *fwspec = arg;
+   irq_hw_number_t hwirq;
+   unsigned int type = IRQ_TYPE_NONE;
+   int i, rc;
+
+   rc = xive_irq_domain_translate(domain, fwspec, , );
+   if (rc)
+   return rc;
+
+   pr_debug("%s %d/%lx #%d\n", __func__, virq, hwirq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++) {
+   /* TODO: call xive_irq_domain_map() */
+
+   /*
+* Mark interrupts as edge sensitive by default so that resend
+* actually works. Will fix that up below if needed.
+*/
+   irq_clear_status_flags(virq, IRQ_LEVEL);
+
+   /* allocates and sets handler data */
+   rc = xive_irq_alloc_data(virq + i, hwirq + i);
+   if (rc)
+   return rc;
+
+   irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq + i,
+ _irq_chip, 
domain->host_data);
+   irq_set_handler(virq + i, handle_fasteoi_irq);
+   }
+
+   return 0;
+}
+
+static void xive_irq_domain_free(struct irq_domain *domain,
+unsigned int virq, unsigned int nr_irqs)
+{
+   int i;
+
+   pr_debug("%s %d #%d\n", __func__, virq, nr_irqs);
+
+   for (i = 0; i < nr_irqs; i++)
+   xive_irq_free_data(virq + i);
+}
+#endif
+
 static const struct irq_domain_ops xive_irq_domain_ops = {
+#ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
+   .alloc  = xive_irq_domain_alloc,
+   .free   = xive_irq_domain_free,
+   .translate = xive_irq_domain_translate,
+#endif
.match = xive_irq_domain_match,
.map = xive_irq_domain_map,
.unmap = xive_irq_domain_unmap,
-- 
2.31.1



[PATCH v2 21/32] powerpc/powernv/pci: Customize the MSI EOI handler to support PHB3

2021-07-01 Thread Cédric Le Goater
PHB3s need an extra OPAL call to EOI the interrupt. The call takes an
OPAL HW IRQ number but it is translated into a vector number in OPAL.
Here, we directly use the vector number of the in-the-middle "PNV-MSI"
domain instead of grabbing the OPAL HW IRQ number in the XICS parent
domain.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e77caa4dbbdf..b498876a976f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2169,12 +2169,33 @@ static void pnv_msi_compose_msg(struct irq_data *d, 
struct msi_msg *msg)
entry->msi_attrib.is_64 ? "64" : "32", d->hwirq, rc);
 }
 
+/*
+ * The IRQ data is mapped in the MSI domain in which HW IRQ numbers
+ * correspond to vector numbers.
+ */
+static void pnv_msi_eoi(struct irq_data *d)
+{
+   struct pci_controller *hose = irq_data_get_irq_chip_data(d);
+   struct pnv_phb *phb = hose->private_data;
+
+   if (phb->model == PNV_PHB_MODEL_PHB3) {
+   /*
+* The EOI OPAL call takes an OPAL HW IRQ number but
+* since it is translated into a vector number in
+* OPAL, use that directly.
+*/
+   WARN_ON_ONCE(opal_pci_msi_eoi(phb->opal_id, d->hwirq));
+   }
+
+   irq_chip_eoi_parent(d);
+}
+
 static struct irq_chip pnv_msi_irq_chip = {
.name   = "PNV-MSI",
.irq_shutdown   = pnv_msi_shutdown,
.irq_mask   = irq_chip_mask_parent,
.irq_unmask = irq_chip_unmask_parent,
-   .irq_eoi= irq_chip_eoi_parent,
+   .irq_eoi= pnv_msi_eoi,
.irq_set_affinity   = irq_chip_set_affinity_parent,
.irq_compose_msi_msg= pnv_msi_compose_msg,
 };
-- 
2.31.1



[PATCH v2 10/32] powerpc/pseries/pci: Add support of MSI domains to PHB hotplug

2021-07-01 Thread Cédric Le Goater
Simply allocate or release the MSI domains when a PHB is inserted in
or removed from the machine.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/platforms/pseries/pseries.h   |  1 +
 arch/powerpc/platforms/pseries/msi.c   | 10 ++
 arch/powerpc/platforms/pseries/pci_dlpar.c |  4 
 3 files changed, 15 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index d9280262588b..3544778e06d0 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -86,6 +86,7 @@ int pseries_root_bridge_prepare(struct pci_host_bridge 
*bridge);
 
 extern struct pci_controller_ops pseries_pci_controller_ops;
 int pseries_msi_allocate_domains(struct pci_controller *phb);
+void pseries_msi_free_domains(struct pci_controller *phb);
 
 unsigned long pseries_memory_block_size(void);
 
diff --git a/arch/powerpc/platforms/pseries/msi.c 
b/arch/powerpc/platforms/pseries/msi.c
index f9635b01b2bf..e2127a3f7ebd 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -733,6 +733,16 @@ int pseries_msi_allocate_domains(struct pci_controller 
*phb)
return __pseries_msi_allocate_domains(phb, count);
 }
 
+void pseries_msi_free_domains(struct pci_controller *phb)
+{
+   if (phb->msi_domain)
+   irq_domain_remove(phb->msi_domain);
+   if (phb->dev_domain)
+   irq_domain_remove(phb->dev_domain);
+   if (phb->fwnode)
+   irq_domain_free_fwnode(phb->fwnode);
+}
+
 static void rtas_msi_pci_irq_fixup(struct pci_dev *pdev)
 {
/* No LSI -> leave MSIs (if any) configured */
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c 
b/arch/powerpc/platforms/pseries/pci_dlpar.c
index a8f9140a24fa..90c9d3531694 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -33,6 +33,8 @@ struct pci_controller *init_phb_dynamic(struct device_node 
*dn)
 
pci_devs_phb_init_dynamic(phb);
 
+   pseries_msi_allocate_domains(phb);
+
/* Create EEH devices for the PHB */
eeh_phb_pe_create(phb);
 
@@ -74,6 +76,8 @@ int remove_phb_dynamic(struct pci_controller *phb)
}
}
 
+   pseries_msi_free_domains(phb);
+
/* Remove the PCI bus and unregister the bridge device from sysfs */
phb->bus = NULL;
pci_remove_bus(b);
-- 
2.31.1



[PATCH v2 06/32] powerpc/xive: Drop unmask of MSIs at startup

2021-07-01 Thread Cédric Le Goater
That was a workaround in the XIVE domain because of the lack of MSI
domain. This is now handled.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 2c907a4a2b05..a03057bfccfd 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -626,16 +626,6 @@ static unsigned int xive_irq_startup(struct irq_data *d)
pr_devel("xive_irq_startup: irq %d [0x%x] data @%p\n",
 d->irq, hw_irq, d);
 
-#ifdef CONFIG_PCI_MSI
-   /*
-* The generic MSI code returns with the interrupt disabled on the
-* card, using the MSI mask bits. Firmware doesn't appear to unmask
-* at that level, so we do it here by hand.
-*/
-   if (irq_data_get_msi_desc(d))
-   pci_msi_unmask_irq(d);
-#endif
-
/* Pick a target */
target = xive_pick_irq_target(d, irq_data_get_affinity_mask(d));
if (target == XIVE_INVALID_TARGET) {
-- 
2.31.1



[PATCH 0/2] powerpc/bpf: Fix issue with atomic ops

2021-07-01 Thread Naveen N. Rao
The first patch fixes an issue that causes a soft lockup on ppc64 with 
the BPF_ATOMIC bounds propagation verifier test. The second one updates 
ppc32 JIT to reject atomic operations properly.

- Naveen

Naveen N. Rao (2):
  powerpc/bpf: Fix detecting BPF atomic instructions
  powerpc/bpf: Reject atomic ops in ppc32 JIT

 arch/powerpc/net/bpf_jit_comp32.c | 14 +++---
 arch/powerpc/net/bpf_jit_comp64.c |  4 ++--
 2 files changed, 13 insertions(+), 5 deletions(-)


base-commit: 086d9878e1092e7e69a69676ee9ec792690abb1d
-- 
2.31.1



[PATCH 1/2] powerpc/bpf: Fix detecting BPF atomic instructions

2021-07-01 Thread Naveen N. Rao
Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and added a way to
distinguish instructions based on the immediate field. Existing JIT
implementations were updated to check for the immediate field and to
reject programs utilizing anything more than BPF_ADD (such as BPF_FETCH)
in the immediate field.

However, the check added to powerpc64 JIT did not look at the correct
BPF instruction. Due to this, such programs would be accepted and
incorrectly JIT'ed resulting in soft lockups, as seen with the atomic
bounds test. Fix this by looking at the correct immediate value.

Fixes: 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other 
atomics in .imm")
Reported-by: Jiri Olsa 
Tested-by: Jiri Olsa 
Signed-off-by: Naveen N. Rao 
---
Hi Jiri,
FYI: I made a small change in this patch -- using 'imm' directly, rather 
than insn[i].imm. I've still added your Tested-by since this shouldn't 
impact the fix in any way.

- Naveen


 arch/powerpc/net/bpf_jit_comp64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 5cad5b5a7e9774..de8595880feec6 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -667,7 +667,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
 * BPF_STX ATOMIC (atomic ops)
 */
case BPF_STX | BPF_ATOMIC | BPF_W:
-   if (insn->imm != BPF_ADD) {
+   if (imm != BPF_ADD) {
pr_err_ratelimited(
"eBPF filter atomic op code %02x (@%d) 
unsupported\n",
code, i);
@@ -689,7 +689,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
PPC_BCC_SHORT(COND_NE, tmp_idx);
break;
case BPF_STX | BPF_ATOMIC | BPF_DW:
-   if (insn->imm != BPF_ADD) {
+   if (imm != BPF_ADD) {
pr_err_ratelimited(
"eBPF filter atomic op code %02x (@%d) 
unsupported\n",
code, i);
-- 
2.31.1



[PATCH 2/2] powerpc/bpf: Reject atomic ops in ppc32 JIT

2021-07-01 Thread Naveen N. Rao
Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and updated all JIT
implementations to reject JIT'ing instructions with an immediate value
different from BPF_ADD. However, ppc32 BPF JIT was implemented around
the same time and didn't include the same change. Update the ppc32 JIT
accordingly.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/net/bpf_jit_comp32.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index cbe5b399ed869d..91c990335a16c9 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -773,9 +773,17 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
break;
 
/*
-* BPF_STX XADD (atomic_add)
+* BPF_STX ATOMIC (atomic ops)
 */
-   case BPF_STX | BPF_XADD | BPF_W: /* *(u32 *)(dst + off) += src 
*/
+   case BPF_STX | BPF_ATOMIC | BPF_W:
+   if (imm != BPF_ADD) {
+   pr_err_ratelimited(
+   "eBPF filter atomic op code %02x (@%d) 
unsupported\n", code, i);
+   return -ENOTSUPP;
+   }
+
+   /* *(u32 *)(dst + off) += src */
+
bpf_set_seen_register(ctx, tmp_reg);
/* Get offset into TMP_REG */
EMIT(PPC_RAW_LI(tmp_reg, off));
@@ -789,7 +797,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
PPC_BCC_SHORT(COND_NE, (ctx->idx - 3) * 4);
break;
 
-   case BPF_STX | BPF_XADD | BPF_DW: /* *(u64 *)(dst + off) += src 
*/
+   case BPF_STX | BPF_ATOMIC | BPF_DW: /* *(u64 *)(dst + off) += 
src */
return -EOPNOTSUPP;
 
/*
-- 
2.31.1



[PATCH] powerpc/xive: Fix error handling when allocating an IPI

2021-07-01 Thread Cédric Le Goater
This is a smatch warning:

  arch/powerpc/sysdev/xive/common.c:1161 xive_request_ipi() warn: unsigned 
'xid->irq' is never less than zero.

Fixes: fd6db2892eba ("powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' 
handler")
Cc: sta...@vger.kernel.org # v5.13
Reported-by: kernel test robot 
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index f8ff558bc305..7bbb9bc83057 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1148,11 +1148,10 @@ static int __init xive_request_ipi(void)
 * Since the HW interrupt number doesn't have any meaning,
 * simply use the node number.
 */
-   xid->irq = irq_domain_alloc_irqs(ipi_domain, 1, node, );
-   if (xid->irq < 0) {
-   ret = xid->irq;
+   ret = irq_domain_alloc_irqs(ipi_domain, 1, node, );
+   if (ret < 0)
goto out_free_xive_ipis;
-   }
+   xid->irq = ret;
 
snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
 
-- 
2.31.1



Re: [PATCH 1/2] powerpc/bpf: Fix detecting BPF atomic instructions

2021-07-01 Thread Alexei Starovoitov
On Thu, Jul 1, 2021 at 8:09 AM Naveen N. Rao
 wrote:
>
> Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
> atomics in .imm") converted BPF_XADD to BPF_ATOMIC and added a way to
> distinguish instructions based on the immediate field. Existing JIT
> implementations were updated to check for the immediate field and to
> reject programs utilizing anything more than BPF_ADD (such as BPF_FETCH)
> in the immediate field.
>
> However, the check added to powerpc64 JIT did not look at the correct
> BPF instruction. Due to this, such programs would be accepted and
> incorrectly JIT'ed resulting in soft lockups, as seen with the atomic
> bounds test. Fix this by looking at the correct immediate value.
>
> Fixes: 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other 
> atomics in .imm")
> Reported-by: Jiri Olsa 
> Tested-by: Jiri Olsa 
> Signed-off-by: Naveen N. Rao 
> ---
> Hi Jiri,
> FYI: I made a small change in this patch -- using 'imm' directly, rather
> than insn[i].imm. I've still added your Tested-by since this shouldn't
> impact the fix in any way.
>
> - Naveen

Excellent debugging! You guys are awesome.
How do you want this fix routed? via bpf tree?


Re: [PATCH 2/2] powerpc/bpf: Reject atomic ops in ppc32 JIT

2021-07-01 Thread Christophe Leroy




Le 01/07/2021 à 17:08, Naveen N. Rao a écrit :

Commit 91c960b0056672 ("bpf: Rename BPF_XADD and prepare to encode other
atomics in .imm") converted BPF_XADD to BPF_ATOMIC and updated all JIT
implementations to reject JIT'ing instructions with an immediate value
different from BPF_ADD. However, ppc32 BPF JIT was implemented around
the same time and didn't include the same change. Update the ppc32 JIT
accordingly.

Signed-off-by: Naveen N. Rao 


Shouldn't it also include a Fixes tag and stable Cc as PPC32 eBPF was added in 
5.13 ?

Fixes: 51c66ad849a7 ("powerpc/bpf: Implement extended BPF on PPC32")
Cc: sta...@vger.kernel.org


---
  arch/powerpc/net/bpf_jit_comp32.c | 14 +++---
  1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index cbe5b399ed869d..91c990335a16c9 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -773,9 +773,17 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
break;
  
  		/*

-* BPF_STX XADD (atomic_add)
+* BPF_STX ATOMIC (atomic ops)
 */
-   case BPF_STX | BPF_XADD | BPF_W: /* *(u32 *)(dst + off) += src 
*/
+   case BPF_STX | BPF_ATOMIC | BPF_W:
+   if (imm != BPF_ADD) {
+   pr_err_ratelimited(
+   "eBPF filter atomic op code %02x (@%d) 
unsupported\n", code, i);
+   return -ENOTSUPP;
+   }
+
+   /* *(u32 *)(dst + off) += src */
+
bpf_set_seen_register(ctx, tmp_reg);
/* Get offset into TMP_REG */
EMIT(PPC_RAW_LI(tmp_reg, off));
@@ -789,7 +797,7 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
PPC_BCC_SHORT(COND_NE, (ctx->idx - 3) * 4);
break;
  
-		case BPF_STX | BPF_XADD | BPF_DW: /* *(u64 *)(dst + off) += src */

+   case BPF_STX | BPF_ATOMIC | BPF_DW: /* *(u64 *)(dst + off) += 
src */
return -EOPNOTSUPP;
  
  		/*




Re: [PATCH v2] sched: Use BUG_ON

2021-07-01 Thread Jeremy Kerr
Hi Jason,

> The BUG_ON macro simplifies the if condition followed by BUG, so that
> we can use BUG_ON instead of if condition followed by BUG.

[...]

> -   if (spu_acquire(ctx))
> -   BUG();  /* a kernel thread never has signals pending */
> +   /* a kernel thread never has signals pending */
> +   BUG_ON(spu_acquire(ctx));

I'm not convinced that this is an improvement; you've combined the
acquire and the BUG into a single statement, and now it's no longer
clear what the comment applies to.

If you really wanted to use BUG_ON, something like this would be more
clear:

rc = spu_acquire(ctx);
/* a kernel thread never has signals pending */
BUG_ON(rc);

but we don't have a suitable rc variable handy, so we'd need one of
those declared too. You could avoid that with:

if (spu_acquire(ctx))
BUG_ON(1); /* a kernel thread never has signals pending */

but wait: no need for the constant there, so this would be better:

if (spu_acquire(ctx))
BUG(); /* a kernel thread never has signals pending */

wait, what are we doing again?

To me, this is a bit of shuffling code around, for no real benefit.

Regards,


Jeremy



Re: [PATCH] powerpc/mm: Fix lockup on kernel exec fault

2021-07-01 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of July 1, 2021 9:17 pm:
> The powerpc kernel is not prepared to handle exec faults from kernel.
> Especially, the function is_exec_fault() will return 'false' when an
> exec fault is taken by kernel, because the check is based on reading
> current->thread.regs->trap which contains the trap from user.
> 
> For instance, when provoking a LKDTM EXEC_USERSPACE test,
> current->thread.regs->trap is set to SYSCALL trap (0xc00), and
> the fault taken by the kernel is not seen as an exec fault by
> set_access_flags_filter().
> 
> Commit d7df2443cd5f ("powerpc/mm: Fix spurrious segfaults on radix
> with autonuma") made it clear and handled it properly. But later on
> commit d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute
> faults") removed that handling, introducing test based on error_code.
> And here is the problem, because on the 603 all upper bits of SRR1
> get cleared when the TLB instruction miss handler bails out to ISI.

So the problem is 603 doesn't see the DSISR_NOEXEC_OR_G bit?

I don't see the problem with this for 64s, I don't think anything sane
can be done for any 0x400 interrupt in the kernel so it's probably
good to catch all here just in case. For 64s,

Acked-by: Nicholas Piggin 

Why is 32s clearing those top bits? And it seems to be setting DSISR
that AFAIKS it does not use. Seems like it would be good to add a
NOEXEC_OR_G bit into SRR1.

Thanks,
Nick


> Until commit cbd7e6ca0210 ("powerpc/fault: Avoid heavy
> search_exception_tables() verification"), an exec fault from kernel
> at a userspace address was indirectly caught by the lack of entry for
> that address in the exception tables. But after that commit the
> kernel mainly rely on KUAP or on core mm handling to catch wrong
> user accesses. Here the access is not wrong, so mm handles it.
> It is a minor fault because PAGE_EXEC is not set,
> set_access_flags_filter() should set PAGE_EXEC and voila.
> But as is_exec_fault() returns false as explained in the begining,
> set_access_flags_filter() bails out without setting PAGE_EXEC flag,
> which leads to a forever minor exec fault.
> 
> As the kernel is not prepared to handle such exec faults, the thing
> to do is to fire in bad_kernel_fault() for any exec fault taken by
> the kernel, as it was prior to commit d3ca587404b3.
> 
> Fixes: d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute faults")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 34f641d4a2fe..a8d0ce85d39a 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -199,9 +199,7 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
> unsigned long error_code,
>  {
>   int is_exec = TRAP(regs) == INTERRUPT_INST_STORAGE;
>  
> - /* NX faults set DSISR_PROTFAULT on the 8xx, DSISR_NOEXEC_OR_G on 
> others */
> - if (is_exec && (error_code & (DSISR_NOEXEC_OR_G | DSISR_KEYFAULT |
> -   DSISR_PROTFAULT))) {
> + if (is_exec) {
>   pr_crit_ratelimited("kernel tried to execute %s page (%lx) - 
> exploit attempt? (uid: %d)\n",
>   address >= TASK_SIZE ? "exec-protected" : 
> "user",
>   address,
> -- 
> 2.25.0
> 
> 


Re: [PATCH] Documentation: PCI: pci-error-recovery: rearrange the general sequence

2021-07-01 Thread Bjorn Helgaas
Please make the subject a little more specific.  "rearrange the
general sequence" doesn't say anything about what was affected.

On Fri, Jun 18, 2021 at 02:04:46PM +0800, Wesley Sheng wrote:
> Reset_link() callback function was called before mmio_enabled() in
> pcie_do_recovery() function actually, so rearrange the general
> sequence betwen step 2 and step 3 accordingly.

s/betwen/between/

Not sure "general" adds anything in this sentence.  "Step 2 and step
3" are not meaningful here in the commit log.  It needs to spell out
what those steps are so the log makes sense by itself.

"reset_link" does not appear in pcie_do_recovery().  I'm guessing
you're referring to the "reset_subordinates" function pointer?

> Signed-off-by: Wesley Sheng 

I didn't quite understand your response to Oliver, so I'll wait for
your corrections and his ack before proceeding.

> ---
>  Documentation/PCI/pci-error-recovery.rst | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/PCI/pci-error-recovery.rst 
> b/Documentation/PCI/pci-error-recovery.rst
> index 187f43a03200..ac6a8729ef28 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure).
> and prints an error to syslog.  A reboot is then required to
> get the device working again.
>  
> -STEP 2: MMIO Enabled
> +STEP 2: Link Reset
> +--
> +The platform resets the link.  This is a PCI-Express specific step
> +and is done whenever a fatal error has been detected that can be
> +"solved" by resetting the link.
> +
> +
> +STEP 3: MMIO Enabled
>  
>  The platform re-enables MMIO to the device (but typically not the
>  DMA), and then calls the mmio_enabled() callback on all affected
> @@ -197,8 +204,8 @@ information, if any, and eventually do things like 
> trigger a device local
>  reset or some such, but not restart operations. This callback is made if
>  all drivers on a segment agree that they can try to recover and if no 
> automatic
>  link reset was performed by the HW. If the platform can't just re-enable IOs
> -without a slot reset or a link reset, it will not call this callback, and
> -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> +without a slot reset, it will not call this callback, and
> +instead will have gone directly or STEP 4 (Slot Reset)

s/or/to/  ?

>  .. note::
>  
> @@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) or 
> STEP 4 (Slot Reset)
> such an error might cause IOs to be re-blocked for the whole
> segment, and thus invalidate the recovery that other devices
> on the same segment might have done, forcing the whole segment
> -   into one of the next states, that is, link reset or slot reset.
> +   into next states, that is, slot reset.

s/into next states/into the next state/ ?

>  The driver should return one of the following result codes:
>- PCI_ERS_RESULT_RECOVERED
> @@ -233,17 +240,11 @@ The driver should return one of the following result 
> codes:
>  
>  The next step taken depends on the results returned by the drivers.
>  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
> +proceeds to STEP 5 (Resume Operations).
>  
>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
>  proceeds to STEP 4 (Slot Reset)
>  
> -STEP 3: Link Reset
> ---
> -The platform resets the link.  This is a PCI-Express specific step
> -and is done whenever a fatal error has been detected that can be
> -"solved" by resetting the link.
> -
>  STEP 4: Slot Reset
>  --
>  
> -- 
> 2.25.1
> 


[PATCH v3 1/5] nvmem: nintendo-otp: Add new driver for the Wii and Wii U OTP

2021-07-01 Thread Emmanuel Gil Peyrot
This OTP is read-only and contains various keys used by the console to
decrypt, encrypt or verify various pieces of storage.

Its size depends on the console, it is 128 bytes on the Wii and
1024 bytes on the Wii U (split into eight 128 bytes banks).

It can be used directly by writing into one register and reading from
the other one, without any additional synchronisation.

This driver was written based on reversed documentation, see:
https://wiiubrew.org/wiki/Hardware/OTP

Signed-off-by: Emmanuel Gil Peyrot 
Tested-by: Jonathan Neuschäfer   # on Wii
Tested-by: Emmanuel Gil Peyrot   # on Wii U
---
 drivers/nvmem/Kconfig|  11 
 drivers/nvmem/Makefile   |   2 +
 drivers/nvmem/nintendo-otp.c | 124 +++
 3 files changed, 137 insertions(+)
 create mode 100644 drivers/nvmem/nintendo-otp.c

diff --git a/drivers/nvmem/Kconfig b/drivers/nvmem/Kconfig
index dd2019006838..39854d43758b 100644
--- a/drivers/nvmem/Kconfig
+++ b/drivers/nvmem/Kconfig
@@ -107,6 +107,17 @@ config MTK_EFUSE
  This driver can also be built as a module. If so, the module
  will be called efuse-mtk.
 
+config NVMEM_NINTENDO_OTP
+   tristate "Nintendo Wii and Wii U OTP Support"
+   help
+ This is a driver exposing the OTP of a Nintendo Wii or Wii U console.
+
+ This memory contains common and per-console keys, signatures and
+ related data required to access peripherals.
+
+ This driver can also be built as a module. If so, the module
+ will be called nvmem-nintendo-otp.
+
 config QCOM_QFPROM
tristate "QCOM QFPROM Support"
depends on ARCH_QCOM || COMPILE_TEST
diff --git a/drivers/nvmem/Makefile b/drivers/nvmem/Makefile
index bbea1410240a..dcbbde35b6a8 100644
--- a/drivers/nvmem/Makefile
+++ b/drivers/nvmem/Makefile
@@ -23,6 +23,8 @@ obj-$(CONFIG_NVMEM_LPC18XX_OTP)   += nvmem_lpc18xx_otp.o
 nvmem_lpc18xx_otp-y:= lpc18xx_otp.o
 obj-$(CONFIG_NVMEM_MXS_OCOTP)  += nvmem-mxs-ocotp.o
 nvmem-mxs-ocotp-y  := mxs-ocotp.o
+obj-$(CONFIG_NVMEM_NINTENDO_OTP)   += nvmem-nintendo-otp.o
+nvmem-nintendo-otp-y   := nintendo-otp.o
 obj-$(CONFIG_MTK_EFUSE)+= nvmem_mtk-efuse.o
 nvmem_mtk-efuse-y  := mtk-efuse.o
 obj-$(CONFIG_QCOM_QFPROM)  += nvmem_qfprom.o
diff --git a/drivers/nvmem/nintendo-otp.c b/drivers/nvmem/nintendo-otp.c
new file mode 100644
index ..33961b17f9f1
--- /dev/null
+++ b/drivers/nvmem/nintendo-otp.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Nintendo Wii and Wii U OTP driver
+ *
+ * This is a driver exposing the OTP of a Nintendo Wii or Wii U console.
+ *
+ * This memory contains common and per-console keys, signatures and
+ * related data required to access peripherals.
+ *
+ * Based on reversed documentation from https://wiiubrew.org/wiki/Hardware/OTP
+ *
+ * Copyright (C) 2021 Emmanuel Gil Peyrot 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define HW_OTPCMD  0
+#define HW_OTPDATA 4
+#define OTP_READ   0x8000
+#define BANK_SIZE  128
+#define WORD_SIZE  4
+
+struct nintendo_otp_priv {
+   void __iomem *regs;
+};
+
+struct nintendo_otp_devtype_data {
+   const char *name;
+   unsigned int num_banks;
+};
+
+static const struct nintendo_otp_devtype_data hollywood_otp_data = {
+   .name = "wii-otp",
+   .num_banks = 1,
+};
+
+static const struct nintendo_otp_devtype_data latte_otp_data = {
+   .name = "wiiu-otp",
+   .num_banks = 8,
+};
+
+static int nintendo_otp_reg_read(void *context,
+unsigned int reg, void *_val, size_t bytes)
+{
+   struct nintendo_otp_priv *priv = context;
+   u32 *val = _val;
+   int words = bytes / WORD_SIZE;
+   u32 bank, addr;
+
+   while (words--) {
+   bank = (reg / BANK_SIZE) << 8;
+   addr = (reg / WORD_SIZE) % (BANK_SIZE / WORD_SIZE);
+   iowrite32be(OTP_READ | bank | addr, priv->regs + HW_OTPCMD);
+   *val++ = ioread32be(priv->regs + HW_OTPDATA);
+   reg += WORD_SIZE;
+   }
+
+   return 0;
+}
+
+static const struct of_device_id nintendo_otp_of_table[] = {
+   { .compatible = "nintendo,hollywood-otp", .data = _otp_data },
+   { .compatible = "nintendo,latte-otp", .data = _otp_data },
+   {/* sentinel */},
+};
+MODULE_DEVICE_TABLE(of, nintendo_otp_of_table);
+
+static int nintendo_otp_probe(struct platform_device *pdev)
+{
+   struct device *dev = >dev;
+   const struct of_device_id *of_id =
+   of_match_device(nintendo_otp_of_table, dev);
+   struct resource *res;
+   struct nvmem_device *nvmem;
+   struct nintendo_otp_priv *priv;
+
+   struct nvmem_config config = {
+   .stride = WORD_SIZE,
+   .word_size = WORD_SIZE,
+   .reg_read = nintendo_otp_reg_read,
+   .read_only = true,
+   

[PATCH v3 0/5] nvmem: nintendo-otp: Add new driver for the Wii and Wii U OTP

2021-07-01 Thread Emmanuel Gil Peyrot
The OTP is a read-only memory area which contains various keys and
signatures used to decrypt, encrypt or verify various pieces of storage.

Its size depends on the console, it is 128 bytes on the Wii and
1024 bytes on the Wii U (split into eight 128 bytes banks).

It can be used directly by writing into one register and reading from
the other one, without any additional synchronisation.

This series has been tested on both the Wii U (using my downstream
master-wiiu branch[1]), as well as on the Wii on mainline.

[1] https://gitlab.com/linkmauve/linux-wiiu/-/commits/master-wiiu

Changes since v1:
- Fixed the commit messages so they can be accepted by other email
  servers, sorry about that.

Changes since v2:
- Switched the dt binding documentation to YAML.
- Used more obvious register arithmetic, and tested that gcc (at -O1 and
  above) outputs the exact same rlwinm instructions for them.
- Use more #defines to make the code easier to read.
- Include some links to the reversed documentation.
- Avoid overlapping dt regions by changing the existing control@d800100
  node to end before the OTP registers, with some bigger dt refactoring
  left for a future series.

Emmanuel Gil Peyrot (5):
  nvmem: nintendo-otp: Add new driver for the Wii and Wii U OTP
  dt-bindings: nintendo-otp: Document the Wii and Wii U OTP support
  powerpc: wii.dts: Reduce the size of the control area
  powerpc: wii.dts: Expose the OTP on this platform
  powerpc: wii_defconfig: Enable OTP by default

 .../bindings/nvmem/nintendo-otp.yaml  |  44 +++
 arch/powerpc/boot/dts/wii.dts |  13 +-
 arch/powerpc/configs/wii_defconfig|   1 +
 drivers/nvmem/Kconfig |  11 ++
 drivers/nvmem/Makefile|   2 +
 drivers/nvmem/nintendo-otp.c  | 124 ++
 6 files changed, 194 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/devicetree/bindings/nvmem/nintendo-otp.yaml
 create mode 100644 drivers/nvmem/nintendo-otp.c

-- 
2.32.0



[PATCH v3 2/5] dt-bindings: nintendo-otp: Document the Wii and Wii U OTP support

2021-07-01 Thread Emmanuel Gil Peyrot
Both of these consoles use the exact same two registers, even at the
same address, but the Wii U has eight banks of 128 bytes memory while
the Wii only has one, hence the two compatible strings.

Signed-off-by: Emmanuel Gil Peyrot 
---
 .../bindings/nvmem/nintendo-otp.yaml  | 44 +++
 1 file changed, 44 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/nvmem/nintendo-otp.yaml

diff --git a/Documentation/devicetree/bindings/nvmem/nintendo-otp.yaml 
b/Documentation/devicetree/bindings/nvmem/nintendo-otp.yaml
new file mode 100644
index ..c39bd64b03b9
--- /dev/null
+++ b/Documentation/devicetree/bindings/nvmem/nintendo-otp.yaml
@@ -0,0 +1,44 @@
+# SPDX-License-Identifier: GPL-2.0
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/nvmem/nintendo-otp.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Nintendo Wii and Wii U OTP Device Tree Bindings
+
+description: |
+  This binding represents the OTP memory as found on a Nintendo Wii or Wii U,
+  which contains common and per-console keys, signatures and related data
+  required to access peripherals.
+
+  See https://wiiubrew.org/wiki/Hardware/OTP
+
+maintainers:
+  - Emmanuel Gil Peyrot 
+
+allOf:
+  - $ref: "nvmem.yaml#"
+
+properties:
+  compatible:
+enum:
+  - nintendo,hollywood-otp
+  - nintendo,latte-otp
+
+  reg:
+maxItems: 1
+
+required:
+  - compatible
+  - reg
+
+unevaluatedProperties: false
+
+examples:
+  - |
+otp@d8001ec {
+compatible = "nintendo,latte-otp";
+reg = <0x0d8001ec 0x8>;
+};
+
+...
-- 
2.32.0



[PATCH v3 3/5] powerpc: wii.dts: Reduce the size of the control area

2021-07-01 Thread Emmanuel Gil Peyrot
This is wrong, but needed in order to avoid overlapping ranges with the
OTP area added in the next commit.  A refactor of this part of the
device tree is needed: according to Wiibrew[1], this area starts at
0x0d80 and spans 0x400 bytes (that is, 0x100 32-bit registers),
encompassing PIC and GPIO registers, amongst the ones already exposed in
this device tree, which should become children of the control@d80
node.

[1] https://wiibrew.org/wiki/Hardware/Hollywood_Registers

Signed-off-by: Emmanuel Gil Peyrot 
---
 arch/powerpc/boot/dts/wii.dts | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/dts/wii.dts b/arch/powerpc/boot/dts/wii.dts
index aaa381da1906..c5fb54f8cc02 100644
--- a/arch/powerpc/boot/dts/wii.dts
+++ b/arch/powerpc/boot/dts/wii.dts
@@ -216,7 +216,13 @@ AVE: audio-video-encoder@70 {
 
control@d800100 {
compatible = "nintendo,hollywood-control";
-   reg = <0x0d800100 0x300>;
+   /*
+* Both the address and length are wrong, according to
+* Wiibrew this should be <0x0d80 0x400>, but it
+* requires refactoring the PIC1 and GPIO nodes before
+* changing that.
+*/
+   reg = <0x0d800100 0xa0>;
};
 
disk@d806000 {
-- 
2.32.0



[PATCH v3 4/5] powerpc: wii.dts: Expose the OTP on this platform

2021-07-01 Thread Emmanuel Gil Peyrot
This can be used by the newly-added nintendo-otp nvmem module.

Signed-off-by: Emmanuel Gil Peyrot 
---
 arch/powerpc/boot/dts/wii.dts | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/wii.dts b/arch/powerpc/boot/dts/wii.dts
index c5fb54f8cc02..e9c945b123c6 100644
--- a/arch/powerpc/boot/dts/wii.dts
+++ b/arch/powerpc/boot/dts/wii.dts
@@ -219,12 +219,17 @@ control@d800100 {
/*
 * Both the address and length are wrong, according to
 * Wiibrew this should be <0x0d80 0x400>, but it
-* requires refactoring the PIC1 and GPIO nodes before
-* changing that.
+* requires refactoring the PIC1, GPIO and OTP nodes
+* before changing that.
 */
reg = <0x0d800100 0xa0>;
};
 
+   otp@d8001ec {
+   compatible = "nintendo,hollywood-otp";
+   reg = <0x0d8001ec 0x8>;
+   };
+
disk@d806000 {
compatible = "nintendo,hollywood-di";
reg = <0x0d806000 0x40>;
-- 
2.32.0



[PATCH v3 5/5] powerpc: wii_defconfig: Enable OTP by default

2021-07-01 Thread Emmanuel Gil Peyrot
This selects the nintendo-otp module when building for this platform.

Signed-off-by: Emmanuel Gil Peyrot 
---
 arch/powerpc/configs/wii_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/wii_defconfig 
b/arch/powerpc/configs/wii_defconfig
index 379c171f3ddd..a0c45bf2bfb1 100644
--- a/arch/powerpc/configs/wii_defconfig
+++ b/arch/powerpc/configs/wii_defconfig
@@ -99,6 +99,7 @@ CONFIG_LEDS_TRIGGER_HEARTBEAT=y
 CONFIG_LEDS_TRIGGER_PANIC=y
 CONFIG_RTC_CLASS=y
 CONFIG_RTC_DRV_GENERIC=y
+CONFIG_NVMEM_NINTENDO_OTP=y
 CONFIG_EXT2_FS=y
 CONFIG_EXT4_FS=y
 CONFIG_FUSE_FS=m
-- 
2.32.0



Re: [RFC PATCH 10/43] powerpc/64s: Always set PMU control registers to frozen/disabled when not in use

2021-07-01 Thread Nicholas Piggin
Excerpts from Madhavan Srinivasan's message of July 1, 2021 11:17 pm:
> 
> On 6/22/21 4:27 PM, Nicholas Piggin wrote:
>> KVM PMU management code looks for particular frozen/disabled bits in
>> the PMU registers so it knows whether it must clear them when coming
>> out of a guest or not. Setting this up helps KVM make these optimisations
>> without getting confused. Longer term the better approach might be to
>> move guest/host PMU switching to the perf subsystem.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>   arch/powerpc/kernel/cpu_setup_power.c | 4 ++--
>>   arch/powerpc/kernel/dt_cpu_ftrs.c | 6 +++---
>>   arch/powerpc/kvm/book3s_hv.c  | 5 +
>>   arch/powerpc/perf/core-book3s.c   | 7 +++
>>   4 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.c 
>> b/arch/powerpc/kernel/cpu_setup_power.c
>> index a29dc8326622..3dc61e203f37 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.c
>> +++ b/arch/powerpc/kernel/cpu_setup_power.c
>> @@ -109,7 +109,7 @@ static void init_PMU_HV_ISA207(void)
>>   static void init_PMU(void)
>>   {
>>  mtspr(SPRN_MMCRA, 0);
>> -mtspr(SPRN_MMCR0, 0);
>> +mtspr(SPRN_MMCR0, MMCR0_FC);
> 
> Sticky point here is, currently if not frozen, pmc5/6 will
> keep countering. And not freezing them at boot is quiet useful
> sometime, like say when running in a simulation where we could calculate
> approx CPIs for micro benchmarks without perf subsystem.

You even can't use the sysfs files in this sim environment? In that case
what if we added a boot option that could set some things up? In that 
case possibly you could even gather some more types of events too.

Thanks,
Nick