On Fri, Dec 26, 2025 at 9:40 AM Mark Kettenis <[email protected]> wrote:
>
> > Date: Fri, 26 Dec 2025 10:44:49 +0100
> > From: Mark Kettenis <[email protected]>
> >
> > > From: K R <[email protected]>
> > > Date: Thu, 25 Dec 2025 14:45:54 -0300
> > >
> > > On Thu, Dec 25, 2025 at 2:11 PM Mark Kettenis <[email protected]> 
> > > wrote:
> > > >
> > > > > From: K R <[email protected]>
> > > > > Date: Wed, 24 Dec 2025 10:43:34 -0300
> > > > >
> > > > > On Tue, Dec 23, 2025 at 6:07 PM Miod Vallat <[email protected]> wrote:
> > > > > >
> > > > > > > Hi, thanks for the diff -- I've just applied it.  According to
> > > > > > > KASSERT(9), DIAGNOSTIC must be enabled in the kernel.  Could you
> > > > > > > confirm that?
> > > > > >
> > > > > > DIAGNOSTIC is enabled in GENERIC and GENERIC.MP kernels.
> > > > >
> > > > > You're right, of course.  I've applied your patch  -- the machine hung
> > > > > building lang/rust, but I got a protection fault again.  The ddb
> > > > > session is still on, in case you need symbols examined.
> > > >
> > > > Can you send us the dmesg for this machine?
> > >
> > > Sure.
> >
> > Thanks.  Can you try the diff below?
>
> So the elaborate on what's going on here: what our TLB flushing code
> is doing is (for a range flush of [sva, eva)):
>
>     tlb_shoot_addr1 = sva;
>     tlb_shoot_addr2 = eva;
>     send_ipi()
>
> where tlb_shoot_addr1 and tlb_shoot_addr2 are global variables.
>
> The IPI handler (running on a different CPU) then does:
>
>     sva = tlb_shoot_addr1;
>     eva = tlb_shoot_addr2;
>     for (va = sva; va < eva; va += PAGE_SIZE)
>         invlpg(va); /* Flush TLB for page */
>
> Now if we're using the x2APIC, send_ipi() is effectively a write to an
> x2APIC MSR.  Normally a write to an MSRs is a serializing instruction,
> which would guarantee that the new values for tlb_shoot_addr1 and
> tlb_shoot_addr2 are visible to the other CPUs.  However, Intel
> explicitly documents writes to the x2APIC as non-serializing
> instructions.  That means they can execute out-of-order.  Which means
> that the first pseudocode fragment could execute as:
>
>    send_ipi();
>    tlb_shoot_addr1 = sva;
>    tlb_shoot_addr2 = eva;
>
> And that means the IPI handler on the other CPU might see the *old*
> values for tlb_shoot_addr1 and tlb_shoot_addr2.  Linux does an
> "mfence; lfence" before the MSR write that sends an IPI.  I learned
> all of this from the following stackoverflow:
>
>   
> https://stackoverflow.com/questions/76352933/will-memory-write-be-visible-after-sending-an-ipi-on-x86
>
> There may be better sources though.
>
> Note that on AMD CPUs the x2APIC MSR writes are serializing
> instructions.  Also note that we don't use the x2APIC on OpenBSD
> unless the BIOS has it enabled already at boot (or if we're running
> under a hypervisor).  That is probably why we haven't seen this issue
> pop up a lot.  My suspicion is that this Dell R440 has the x2APIC at
> boot.
>
> Here is an updated version of the diff that prints the x2APIC state
> such that we can verify this hypothesis.

Hypothesis verified:

test# dmesg | grep x2apic
lapic_map: x2apic_enabled 1

The machine is alive so far with your lapic diff applied -- no hangs!
Already survived 3 rounds of lang/rust builds, normally it would hang
in the middle of the first one.  I'll keep you posted.  Thanks!

>
>
> Index: arch/amd64/amd64/lapic.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/amd64/lapic.c,v
> diff -u -p -r1.76 lapic.c
> --- arch/amd64/amd64/lapic.c    12 Nov 2025 09:48:52 -0000      1.76
> +++ arch/amd64/amd64/lapic.c    26 Dec 2025 12:40:06 -0000
> @@ -235,6 +235,8 @@ lapic_map(paddr_t lapic_base)
>             (uint64_t)va, (uint64_t)lapic_base);
>
>         intr_restore(s);
> +
> +       printf("%s: x2apic_enabled %d\n", __func__, x2apic_enabled);
>  }
>
>  /*
> @@ -691,6 +693,7 @@ x2apic_ipi(int vec, int target, int dl)
>
>         lo = (target & LAPIC_DEST_MASK) | vec | dl | LAPIC_LVL_ASSERT;
>
> +       __asm volatile("mfence; lfence" ::: "memory");
>         x2apic_writeicr(hi, lo);
>  }
>

Reply via email to