Re: [elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6

2016-10-06 Thread Akemi Yagi
On Thu, Oct 6, 2016 at 3:27 PM, Akemi Yagi  wrote:
> On Thu, Oct 6, 2016 at 2:22 PM, Grigory Shamov
>  wrote:
>> An update:
>>
>> Looks like the same issue was observed in RedHat 7 kernels, also based on
>> 3.10:
>> This pertains to perf_event_overflow error with increased
>> kernel.watchdog.thresh
>>
>> https://access.redhat.com/solutions/1354963
>>
>> ```
>> * Red Hat Enterprise Linux (RHEL) 7
>> * seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64)
>> * the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value
>> than the default
>> * Docker
>> ```
>>
>> They report panic on Docker; we see it on normal app workload
>> (but HPC applications are long-running and use lot of memory, so they can
>> be somewhat similar to a heavily used container).
>>
>> The RedHat solution basically suggests to update to their later kernel.
>> What would one does with the Elrepo one?
>
> I'd like to track down the patch(es) Red Hat applied to fix the issue.
> It is possible that, while kernel-lt does not have the patch,
> kernel-ml may have it. At any rate the patch must be identified to
> find that out.

I now suspect the following patch was the one:

https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/9809b18fcf6b8d8ec4d3643677345907e6b50eca

It first appeared in kernel 3.12. RH backported it to 7.1/7.2 kernels.

Akemi
___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo


Re: [elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6

2016-10-06 Thread Grigory Shamov
An update:

Looks like the same issue was observed in RedHat 7 kernels, also based on
3.10:
This pertains to perf_event_overflow error with increased
kernel.watchdog.thresh


https://access.redhat.com/solutions/1354963

```
* Red Hat Enterprise Linux (RHEL) 7
* seen on several versions of the RHEL7 kernel (3.10.0-version.el7.x86_64)
* the /proc/sys/kernel/watchdog_thresh parameter is set to a higher value
than the default
* Docker
```

They report panic on Docker; we see it on normal app workload
(but HPC applications are long-running and use lot of memory, so they can
be somewhat similar to a heavily used container).

The RedHat solution basically suggests to update to their later kernel.
What would one does with the Elrepo one?



-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625





On 2016-10-06, 11:31 AM, "elrepo-boun...@lists.elrepo.org on behalf of
Grigory Shamov"  wrote:

>Hi All,
>
>We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our
>HPC cluster. 
>The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips,
>SSE4.2).
>We have first tested if the kernel works with our driver stack, were
>satisfied, and went to production.
>
>It turned out though that under production load,  time to time, on some of
>the nodes (a few of them, seemingly at random), kernel panics on
>nmi_watchdog hard lockups (and time to time emits barfs about soft
>lockups) emitting various messages like this:
>
>³²"
>Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3
> Š Call trace follows; mentions watchdog_overflow_callback Š
>Shutting down cpus with NMI
>drms_kms_helper: panic occurred, switching back to text console
>³²²
>
>Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it
>is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60.
>It made things worse, the test node quickly had kernel panic with Call
>trace mentioning ³perf_event_overflow².
>
>Is there anything we can do about these errors, and what would be the
>possible reason for them? Could anyone suggest a fix? Thank you very much
>in advance.  
>
>
>-- 
>Grigory Shamov
>
>Westgrid/ComputeCanada Site Lead
>University of Manitoba
>E2-588 EITC Building,
>(204) 474-9625
>
>
>
>
>
>___
>elrepo mailing list
>elrepo@lists.elrepo.org
>http://lists.elrepo.org/mailman/listinfo/elrepo

___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo


[elrepo] hard lockups on CPU's with elrepo kernel 3.10.103 on CentOS 6

2016-10-06 Thread Grigory Shamov
Hi All,

We are running kernel-lt-3.10.103 on about 300 CentOS 6.8 machines of our
HPC cluster. 
The machines are fairly old Intel Xeon X5650s (Wesmere/Nehalem chips,
SSE4.2).
We have first tested if the kernel works with our driver stack, were
satisfied, and went to production.

It turned out though that under production load,  time to time, on some of
the nodes (a few of them, seemingly at random), kernel panics on
nmi_watchdog hard lockups (and time to time emits barfs about soft
lockups) emitting various messages like this:

³²"
Kernel panic - not synching: Watchdog detected hard LOCKUP on cpu 3
 Š Call trace follows; mentions watchdog_overflow_callback Š
Shutting down cpus with NMI
drms_kms_helper: panic occurred, switching back to text console
³²²

Then we have tried simply to increase kernel.watchdog_thresh; on 3.10 it
is set to 10, while on CentOS 6 2.6.32 kernel it used to be 60.
It made things worse, the test node quickly had kernel panic with Call
trace mentioning ³perf_event_overflow².

Is there anything we can do about these errors, and what would be the
possible reason for them? Could anyone suggest a fix? Thank you very much
in advance.  


-- 
Grigory Shamov

Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625





___
elrepo mailing list
elrepo@lists.elrepo.org
http://lists.elrepo.org/mailman/listinfo/elrepo