Re: Kernel panics after some hours of use (likely related to modeset)

2018-01-08 Thread Jonathan Gray
On Mon, Jan 08, 2018 at 05:20:39PM -0800, Mike Larkin wrote:
> On Tue, Jan 09, 2018 at 12:44:04AM +0100, azarus wrote:
> > To: bugs@openbsd.org
> > Subject: Kernel panics after some hours of use (likely related to modeset)
> > From: aza...@posteo.net
> > Cc: aza...@posteo.net
> > Reply-To: aza...@posteo.net
> > 
> > >Synopsis:  The kernel panics reproducibly after a couple of hours of use 
> > >(2-4 hours)
> > >Category:  system amd64 kernel
> > >Environment:
> > System  : OpenBSD 6.2
> > Details : OpenBSD 6.2-current (GENERIC.MP) #333: Sun Jan  7 
> > 09:13:00 MST 2018
> >  
> > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > 
> > Architecture: OpenBSD.amd64
> > Machine : amd64
> > >Description:
> > In snapshots #320-#333 (every second snapshot or so tested) the kernel
> > hangs reproducibly after some hours of use. During use I have a pdf
> > viewer (mupdf), a browser (Firefox), tmux, an editor (nvim), a music
> > player (mpd) and some shells open (zsh).
> > 
> > This issue happens often when I leave the computer for some minutes, so
> > it might be something related to the screen turning off (modeset).
> > 
> > This might not be relevant, but I tried both with softdep enabled and
> > disabled, to the same result.
> > 
> > The machine is a ThinkPad X230, with Coreboot. (But I doubt it's
> > coreboot causing the issue, as the computer's not going to sleep)
> > 
> > I cannot provide a dmesg of the crashed system, as "boot dump" fails.
> > 
> > For the complete kernel error message, trace output, show registers
> > ouput and ps output, please regard attached pictures.
> > 
> > >How-To-Repeat:
> > 1. Use machine for a couple of hours
> > 2. Leave machine for some time (5-15 minutes)
> > 3. Kernel panics with "uvm_fault(0xfff81b4b158, 0x0, 0, 1) -> e"
> > >Fix:
> > unknown
> > 
> 
> A few of us have been seeing this, so we know about the issue. There is
> no fix at this time however. Thanks for reporting it though.

This is the workaround I have in my tree to avoid the NULL deref.

Index: sys/dev/pci/drm/linux_ww_mutex.h
===
RCS file: /cvs/src/sys/dev/pci/drm/linux_ww_mutex.h,v
retrieving revision 1.1
diff -u -p -r1.1 linux_ww_mutex.h
--- sys/dev/pci/drm/linux_ww_mutex.h1 Jul 2017 16:14:10 -   1.1
+++ sys/dev/pci/drm/linux_ww_mutex.h13 Aug 2017 06:40:35 -
@@ -163,7 +163,8 @@ __ww_mutex_lock(struct ww_mutex *lock, s
  *   the `younger` process gives up all it's
  *   resources.
 */
-   if (slow || ctx == NULL || ctx->stamp < 
lock->ctx->stamp) {
+   if (slow || ctx == NULL ||
+   (lock->ctx != NULL && ctx->stamp < 
lock->ctx->stamp)) {
int s = msleep(lock, >lock,
   intr ? PCATCH : 0,
   ctx ? ctx->ww_class->name : 
"ww_mutex_lock", 0);



Re: Kernel panics after some hours of use (likely related to modeset)

2018-01-08 Thread Mike Larkin
On Tue, Jan 09, 2018 at 12:44:04AM +0100, azarus wrote:
> To: bugs@openbsd.org
> Subject: Kernel panics after some hours of use (likely related to modeset)
> From: aza...@posteo.net
> Cc: aza...@posteo.net
> Reply-To: aza...@posteo.net
> 
> >Synopsis:The kernel panics reproducibly after a couple of hours of use 
> >(2-4 hours)
> >Category:system amd64 kernel
> >Environment:
>   System  : OpenBSD 6.2
>   Details : OpenBSD 6.2-current (GENERIC.MP) #333: Sun Jan  7 
> 09:13:00 MST 2018
>
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
> In snapshots #320-#333 (every second snapshot or so tested) the kernel
> hangs reproducibly after some hours of use. During use I have a pdf
> viewer (mupdf), a browser (Firefox), tmux, an editor (nvim), a music
> player (mpd) and some shells open (zsh).
> 
> This issue happens often when I leave the computer for some minutes, so
> it might be something related to the screen turning off (modeset).
> 
> This might not be relevant, but I tried both with softdep enabled and
> disabled, to the same result.
> 
> The machine is a ThinkPad X230, with Coreboot. (But I doubt it's
> coreboot causing the issue, as the computer's not going to sleep)
> 
> I cannot provide a dmesg of the crashed system, as "boot dump" fails.
> 
> For the complete kernel error message, trace output, show registers
> ouput and ps output, please regard attached pictures.
> 
> >How-To-Repeat:
> 1. Use machine for a couple of hours
> 2. Leave machine for some time (5-15 minutes)
> 3. Kernel panics with "uvm_fault(0xfff81b4b158, 0x0, 0, 1) -> e"
> >Fix:
> unknown
> 

A few of us have been seeing this, so we know about the issue. There is
no fix at this time however. Thanks for reporting it though.

-ml

> dmesg:
> OpenBSD 6.2-current (GENERIC.MP) #333: Sun Jan  7 09:13:00 MST 2018
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 8494600192 (8101MB)
> avail mem = 8230227968 (7848MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xbff28020 (10 entries)
> bios0: vendor coreboot version "CBET4000 4.6-196-g0fb6568" date 05/22/2017
> bios0: LENOVO 2325YBN
> acpi0 at bios0: rev 2
> acpi0: sleep states S0 S3 S4 S5
> acpi0: tables DSDT FACP SSDT MCFG TCPA APIC DMAR HPET
> acpi0: wakeup devices HDEF(S4) EHC1(S4) EHC2(S4) XHC_(S4) SLPB(S3) LID_(S3)
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimcfg0 at acpi0 addr 0xf800, bus 0-63
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz, 2594.53 MHz
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,SENSOR,ARAT
> cpu0: 256KB 64b/line 8-way L2 cache
> acpitimer0: recalibrated TSC frequency 2594107462 Hz
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
> cpu0: apic clock running at 99MHz
> cpu0: mwait min=64, max=64, C-substates=0.2.1.1.2, IBE
> cpu1 at mainbus0: apid 1 (application processor)
> cpu1: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz, 2594.12 MHz
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,SENSOR,ARAT
> cpu1: 256KB 64b/line 8-way L2 cache
> cpu1: smt 1, core 0, package 0
> cpu2 at mainbus0: apid 2 (application processor)
> cpu2: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz, 2594.12 MHz
> cpu2: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,SENSOR,ARAT
> cpu2: 256KB 64b/line 8-way L2 cache
> cpu2: smt 0, core 1, package 0
> cpu3 at mainbus0: apid 3 (application processor)
> cpu3: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz, 2594.12 MHz
> cpu3: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,SENSOR,ARAT
> cpu3: 256KB 64b/line 8-way L2 cache
> cpu3: smt 1, core 1, package 0
> ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 

Re: ktrace firefox freeze my box

2018-01-08 Thread Martin Pieuchot
On 03/01/18(Wed) 10:36, Martin Pieuchot wrote:
> On 02/01/18(Tue) 12:48, Ted Unangst wrote:
> > Martin Pieuchot wrote:
> > > on -current amd64, simply doing "$ ktrace -p $pid_of_firefox" is enough
> > > to freeze my box:
> > 
> > ktrace of chrome would do that going back at least six months.
> 
> And nobody reported the bug!  Now it's harder to track it down :/

For the archives it's now fixed.  It was a bug of mines present since
the import of the syscall.