Re: [CentOS] C 7: smpboot: CPU 16 is now offline, and slabs...

2018-06-13 Thread m . roth
m.r...@5-cent.us wrote:
> m.r...@5-cent.us wrote:
>> m.r...@5-cent.us wrote:
>>> Current kernel, and I just booted, and dmesg shows, of the 32 cores, 0,
>>> 2, 4 and 6 ok, and *all* other show "is now offline.
>>>
>>> What's happening here?
> 
> Ok, more info. I found how to online a CPU -
> echo 1 > /sys/devices/system/cpu/cpu23/online
>
> Perhaps I should have started with 1,3, etc, but I was doing the 20's,
> instead. Got to CPU27... and the system rebooted.
>
> Now I'm wondering if the offline'd CPUs have something to do with the fact
> that this (and an identical one, in the datacenter, are rebooting around
> 04:00 every day. Btw, they're Dell PE R530's from 2016
>
Still more info (come on, folks, help me out!): these two machines that
keep rebooting, and only one other that doesn't, have Intel E5-2630's in
them. These two are v3, while the one other is a v.2. The latter's
microcode is
microcode: CPU0 sig=0x306e4, pf=0x1, revision=0x428
while on the two that reboot, they have
microcode: CPU0 sig=0x306f2, pf=0x1, revision=0x3a

Anyone think I might be going down the wrong path? Any thoughts at all? If
not, any cmts on my downgrading to the previous microcode? This happened
once a week ago, and then, starting last Friday, began happening at least
around 04:00 every day.

mark



___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] C 7: smpboot: CPU 16 is now offline, and slabs...

2018-06-13 Thread m . roth
m.r...@5-cent.us wrote:
> m.r...@5-cent.us wrote:
>> Current kernel, and I just booted, and dmesg shows, of the 32 cores, 0,
>> 2, 4 and 6 ok, and *all* other show "is now offline.
>>
>> What's happening here?

Ok, more info. I found how to online a CPU -
echo 1 > /sys/devices/system/cpu/cpu23/online

Perhaps I should have started with 1,3, etc, but I was doing the 20's,
instead. Got to CPU27... and the system rebooted.

Now I'm wondering if the offline'd CPUs have something to do with the fact
that this (and an identical one, in the datacenter, are rebooting around
04:00 every day. Btw, they're Dell PE R530's from 2016

  mark

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] C 7: smpboot: CPU 16 is now offline, and slabs...

2018-06-13 Thread m . roth
m.r...@5-cent.us wrote:
> Current kernel, and I just booted, and dmesg shows, of the 32 cores, 0, 2,
> 4 and 6 ok, and *all* other show "is now offline.
>
> What's happening here?
>
A followup: I also find a core in /var/spool/abrt, and "reason" is
 kernel BUG at mm/slub.c:3601!

In googling, I see threads about incorrect calculation of slabs. Following
one thread, I find
cat /sys/kernel/slab/:t-048/cpu_slabs

gives me

4 N0=4

Meanwhile, slabtop shows
 Active / Total Slabs (% used)  : 25927 / 25927 (100.0%)

Which changes, but just varying around that number, and st 100%.

So: should I increase the number of slabs, using the kernel parm of
swiotlb, and if so, for what I show above, should I set it to, say, 32000?

   mark

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos