On Thu, Jan 28, 2021 at 12:21:26PM +1100, Jonathan Gray wrote:
> On Wed, Jan 27, 2021 at 07:11:49AM +0100, alf wrote:
> > Hello,
> > 
> > while trying to upgrade one of our machines to 6.8 we experienced a
> > repeatable crash while booting (bsd.rd + install went fine).
> > 
> > The machine in question is a:
> > ...
> > hw.vendor=HP
> > hw.product=ProLiant DL360 G7
> > hw.serialno=CZ3451KJW6
> > hw.uuid=36333337-3738-435a-3334-35314b4a5736
> > hw.physmem=8562860032
> > hw.usermem=8562847744
> > hw.ncpufound=12
> > hw.allowpowerdown=1
> > hw.perfpolicy=manual
> > hw.smt=0
> > hw.ncpuonline=6
> > ...
> > 
> > Since this is a production machine we downgraded to 6.7 (upgrade from
> > 6.6 which it was running before went flawlessly).
> > 
> > Find below the dmesg of the 6.8 kernel, 6.8-current and finally the
> > 6.7 kernel. For the 6.8* I also provided 'trace' and 'show registers'
> > output.
> > 
> > I hope this is enough info to get an idea of what was going on.
> > I'll happily will provide additional info if needed.
> > 
> > Alf
> > 
> > cpu0: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz, 2667.08 MHz, 06-2c-02
> > cpu0: 
> > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AES,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,MELTDOWN
> 
> > initializing kernel modesetting (RV100 0x1002:0x515E 0x103C:0x31FB 0x02).
> > NMI ... going to debugger
> > Stopped at      tsc_delay+0x63: lfence
> > ddb{0}> trace
> > tsc_delay(1) at tsc_delay+0x63
> > r100_ring_test(ffff8000001a4000,ffff8000001a5858) at r100_ring_test+0x277
> > r100_cp_init(ffff8000001a4000,100000) at r100_cp_init+0x5a1
> > r100_startup(ffff8000001a4000) at r100_startup+0x535
> > r100_init(ffff8000001a4000) at r100_init+0x4ac
> > radeon_device_init(ffff8000001a4000,ffff800000196800,ffff800000196840,840001)
> >  a
> > t radeon_device_init+0x944
> > radeondrm_attachhook(ffff8000001a4000) at radeondrm_attachhook+0x36
> > config_process_deferred_mountroot() at 
> > config_process_deferred_mountroot+0x6b
> > main(0) at main+0x723
> > end trace frame: 0x0, count: -9
> 
> I don't understand why an lfence would cause an nmi.
> 
> Does it still occur with the below diff to change lfence;rdtsc to rdtscp?
> This requires RDTSCP which your machine has but bluhm's machine does not.
> 
> Perhaps it is related to some kind of watchdog timer?  Can you check if
> the ilo event log has any relevant information?

Checked the ilo eventlog, didn't provide any info. It whines about not being 
able 
to talk to the ntp server for ages though, but I doubt that that has anything 
to 
do with this.

Alf

> 
> Index: sys/arch/amd64/include/cpufunc.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v
> retrieving revision 1.36
> diff -u -p -r1.36 cpufunc.h
> --- sys/arch/amd64/include/cpufunc.h  13 Sep 2020 11:53:16 -0000      1.36
> +++ sys/arch/amd64/include/cpufunc.h  28 Jan 2021 00:47:16 -0000
> @@ -307,7 +307,8 @@ rdtsc_lfence(void)
>  {
>       uint32_t hi, lo;
>  
> -     __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
> +//   __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
> +     __asm volatile("rdtscp" : "=d" (hi), "=a" (lo) :: "ecx");
>       return (((uint64_t)hi << 32) | (uint64_t) lo);
>  }
>  
> 
> 

Reply via email to