An update:

Looks like the same issue was observed in RedHat 7 kernels, also based on
3.10:
This pertains to perf_event_overflow error with increased
kernel.watchdog.thresh


https://access.redhat.com/solutions/1354963

```
* Red Hat Enterprise Linux (RHEL) 7
* seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64)
* the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value
than the default
* Docker
```

They report panic on Docker; we see it on normal app workload
(but HPC applications are long-running and use lot of memory, so they can
be somewhat similar to a heavily used container).

The RedHat solution basically suggests to update to their later kernel.
What would one does with the Elrepo one?



-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625





On 2016-10-06, 11:31 AM, "elrepo-boun...@lists.elrepo.org on behalf of
Grigory Shamov" <elrepo-boun...@lists.elrepo.org on behalf of
grigory.sha...@umanitoba.ca> wrote:

>Hi All,
>
>We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our
>HPC cluster. 
>The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips,
>SSE4.2).
>We have first tested if the kernel works with our driver stack, were
>satisfied, and went to production.
>
>It turned out though that under production load,  time to time, on some of
>the nodes (a few of them, seemingly at random), kernel panics on
>nmi_watchdog hard lockups (and time to time emits barfs about soft
>lockups) emitting various messages like this:
>
>³²"
>Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3
> Š Call trace follows; mentions watchdog_overflow_callback Š
>Shutting down cpus with NMI
>drms_kms_helper: panic occurred, switching back to text console
>³²²
>
>Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it
>is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60.
>It made things worse, the test node quickly had kernel panic with Call
>trace mentioning ³perf_event_overflow².
>
>Is there anything we can do about these errors, and what would be the
>possible reason for them? Could anyone suggest a fix? Thank you very much
>in advance.  
>
>
>-- 
>Grigory Shamov
>
>Westgrid/ComputeCanada Site Lead
>University of Manitoba
>E2-588 EITC Building,
>(204) 474-9625
>
>
>
>
>
>_______________________________________________
>elrepo mailing list
>elrepo@lists.elrepo.org
>http://lists.elrepo.org/mailman/listinfo/elrepo

_______________________________________________
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo

Reply via email to