Re: [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged

IBM Thu, 31 Jul 2025 11:45:37 -0700

Donet Tom <donet...@linux.ibm.com> writes:

> On systems using the hash MMU, there is a software SLB preload cache that
> mirrors the entries loaded into the hardware SLB buffer. This preload
> cache is subject to periodic eviction — typically after every 256 context
> switches — to remove old entry.
>
> Currently, the kernel skips the MMU context switch in switch_mm_irqs_off()
> if the prev and next mm_struct are the same, as an optimization. However,
> this behavior can lead to problems on hash MMU systems.
>


Let's also add detailed flow of events, as this was not really an easy
problem to catch. 

CPU 0                                   CPU 1

Process P
exec                                    swapper/1
 load_elf_binary
  begin_new_exc
    activate_mm
     switch_mm_irqs_off 
      switch_mmu_context
       switch_slb
       /* 
        * This invalidates all the
        * entries in the HW and setup 
        * the new HW SLB entries as per 
        * the preload cache.
        */
switch_switch
sched_migrate_task migrates process P to cpu-1

Process swapper/0                       context switch (to process P)
(uses mm_struct of Process P)           switch_mm_irqs_off()
                                         switch_slb
                                           load_slb++
                                            /*
                                            * load_slb becomes 0 here
                                            * and we evict an entry from
                                            * the preload cache with
                                            * preload_age(). We still
                                            * keep HW SLB and preload
                                            * cache in sync, that is
                                            * because all HW SLB entries
                                            * anyways gets evicted in
                                            * switch_slb during SLBIA.
                                            * We then only add those
                                            * entries back in HW SLB,
                                            * which are currently
                                            * present in preload_cache
                                            * (after eviction).
                                            */
                                        load_elf_binary continues...
                                         setup_new_exec()
                                          slb_setup_new_exec()

                                        sched_switch event
                                        sched_migrate_task migrates 
                                        process P to cpu-0

context_switch from swapper/0 to Process P
 switch_mm_irqs_off()
  /*
   * Since both prev and next mm struct are same we don't call
   * switch_mmu_context(). This will cause the HW SLB and SW preload
   * cache to go out of sync in preload_new_slb_context. Because there
   * was an SLB entry which was evicted from both HW and preload cache
   * on cpu-1. Now later in preload_new_slb_context(), when we will try
   * to add the same preload entry again, we will add this to the SW
   * preload cache and then will add it to the HW SLB. Since on cpu-0
   * this entry was never invalidated, hence adding this entry to the HW
   * SLB will cause a SLB multi-hit error.
   */
load_elf_binary continues...
 START_THREAD
  start_thread
   preload_new_slb_context
   /*
    * This tries to add a new EA to preload cache which was earlier
    * evicted from both cpu-1 HW SLB and preload cache. This caused the
    * HW SLB of cpu-0 to go out of sync with the SW preload cache. The
    * reason for this was, that when we context switched back on CPU-0,
    * we should have ideally called switch_mmu_context() which will
    * bring bring the HW SLB entries on CPU-0 in sync with SW preload cache
    * entries by setting up the mmu context properly. But we didn't do
    * that since the prev mm_struct running on cpu-0 was same as the
    * next mm_struct (which is true for swapper / kernel threads). So
    * now when we try to add this new entry into the HW SLB of cpu-0,
    * we hit a SLB multi-hit error.
    */

WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62 
assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12 VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 
0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP:  c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700   Not tainted  (6.16.0-rc3-dirty)
MSR:  8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28888482  XER: 
00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
  0x7fffceb5ffff (unreliable)
  preload_new_slb_context+0x100/0x1a0
  start_thread+0x26c/0x420
  load_elf_binary+0x1b04/0x1c40
  bprm_execve+0x358/0x680
  do_execveat_common+0x1f8/0x240
  sys_execve+0x58/0x70
  system_call_exception+0x114/0x300
  system_call_common+0x160/0x2c4


> Consider the following scenario: a process is running on CPU A and gets
> context-switched to CPU B. During this time, one of its SLB preload cache
> entries is evicted. Later, the process is rescheduled on CPU A, which was
> running swapper in the meantime, using the same mm_struct. Because
> prev == next, the kernel skips the MMU context switch. As a result, the
> hardware SLB buffer still contains the entry, but the software preload
> cache does not.
> 
> The absence of the entry in the preload cache causes it to attempt to
> reload the SLB. However, since the entry is already present in the hardware
> SLB, this leads to a SLB multi-hit error.
>

Can we use the detailed commit msg from above instead of above two paragraphs.
It is easier to visualize and document if we have it that way.


> To fix this issue, we add a code change to always switch the MMU context on
> hash MMU if the SLB preload cache has aged. With this change, the
> SLB multi-hit error no longer occurs.
>
> Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")

CC: sta...@vger.kernel.org

Otherwise LGTM.

> Suggested-by: Ritesh Harjani (IBM) <ritesh.l...@gmail.com>
> Signed-off-by: Donet Tom <donet...@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/slb.c | 2 +-
>  arch/powerpc/mm/mmu_context.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> index 6b783552403c..08daac3f978c 100644
> --- a/arch/powerpc/mm/book3s64/slb.c
> +++ b/arch/powerpc/mm/book3s64/slb.c
> @@ -509,7 +509,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
> *mm)
>        * SLB preload cache.
>        */
>       tsk->thread.load_slb++;
> -     if (!tsk->thread.load_slb) {
> +     if (tsk->thread.load_slb == U8_MAX) {
>               unsigned long pc = KSTK_EIP(tsk);
>  
>               preload_age(ti);
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 3e3af29b4523..d7b9ac8c9971 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -84,7 +84,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>       switch_mm_pgdir(tsk, next);
>  
>       /* Nothing else to do if we aren't actually switching */
> -     if (prev == next)
> +     if ((prev == next) && (tsk->thread.load_slb != U8_MAX))
>               return;
>  
>       /*
> -- 
> 2.50.1

Re: [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged

Reply via email to