An update: Looks like the same issue was observed in RedHat 7 kernels, also based on 3.10: This pertains to perf_event_overflow error with increased kernel.watchdog.thresh
https://access.redhat.com/solutions/1354963 ``` * Red Hat Enterprise Linux (RHEL) 7 * seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64) * the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value than the default * Docker ``` They report panic on Docker; we see it on normal app workload (but HPC applications are long-running and use lot of memory, so they can be somewhat similar to a heavily used container). The RedHat solution basically suggests to update to their later kernel. What would one does with the Elrepo one? -- Grigory Shamov Westgrid/ComputeCanada Site Lead University of Manitoba E2-588 EITC Building, (204) 474-9625 On 2016-10-06, 11:31 AM, "elrepo-boun...@lists.elrepo.org on behalf of Grigory Shamov" <elrepo-boun...@lists.elrepo.org on behalf of grigory.sha...@umanitoba.ca> wrote: >Hi All, > >We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our >HPC cluster. >The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips, >SSE4.2). >We have first tested if the kernel works with our driver stack, were >satisfied, and went to production. > >It turned out though that under production load, time to time, on some of >the nodes (a few of them, seemingly at random), kernel panics on >nmi_watchdog hard lockups (and time to time emits barfs about soft >lockups) emitting various messages like this: > >³²" >Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3 > Š Call trace follows; mentions watchdog_overflow_callback Š >Shutting down cpus with NMI >drms_kms_helper: panic occurred, switching back to text console >³²² > >Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it >is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60. >It made things worse, the test node quickly had kernel panic with Call >trace mentioning ³perf_event_overflow². > >Is there anything we can do about these errors, and what would be the >possible reason for them? Could anyone suggest a fix? Thank you very much >in advance. > > >-- >Grigory Shamov > >Westgrid/ComputeCanada Site Lead >University of Manitoba >E2-588 EITC Building, >(204) 474-9625 > > > > > >_______________________________________________ >elrepo mailing list >elrepo@lists.elrepo.org >http://lists.elrepo.org/mailman/listinfo/elrepo _______________________________________________ elrepo mailing list elrepo@lists.elrepo.org http://lists.elrepo.org/mailman/listinfo/elrepo