Was running wazuh 2.8.1 agent on "most" systems, with the wazuh ossec
docker container for a master server.
Upgraded to 2.8.3 to try to resolve this problem, with no luck.
Out of about 160 machines, 4-5 of them will reliably wedge themselves after
some amount of time with messages akin to:
2017 Feb 28 15:35:34 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck
for 22s! [ossec-syscheckd:12608]
2017 Feb 28 15:36:02 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck
for 22s! [ossec-syscheckd:12608]
2017 Feb 28 15:36:34 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck
for 23s! [ossec-syscheckd:12608]
If this continues long enough, the entire system grinds to a halt, and
requires Big Red Button service.
I finally managed to attach an strace to one today, but I may not have
gotten it right.
# strace -e trace=read,write -p 12608
which displayed an awful lot of noise (I'd just done a clean reinstall of
ossec-hids-agent) of the format:
read(7, "ST6=m\nCONFIG_CRYPTO_CAST6_AVX_X8"..., 1024) = 1024
read(7, "TO_DEV_QAT=m\nCONFIG_CRYPTO_DEV_Q"..., 1024) = 1024
read(7, "G_PERCPU_RWSEM=y\nCONFIG_ARCH_USE"..., 1024) = 1024
read(7, "NFIG_TEXTSEARCH_BM=m\nCONFIG_TEXT"..., 1024) = 479
read(7, "", 1024) = 0
before going into this loop:
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13344,
si_status=0,si_utime=0, si_stime=0}
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13346,
si_status=0,si_utime=0, si_stime=0}
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13355,
si_status=0,si_utime=0, si_stime=0}
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13357,
si_status=0,si_utime=0, si_stime=0}
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13359,
si_status=0,si_utime=0, si_stime=0}
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13361,
si_status=0,si_utime=0, si_stime=0}
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13363,
si_status=0,si_utime=0, si_stime=0}
.......
SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13660,
si_status=0,si_utime=0, si_stime=0}
and then the soft lockup messages started-- and no, I didn't think to
attach an strace to pid 13660 until after I'd rebooted.
It's a production server, and while it's not heavily used, it's used enough
that we don't want it off during production hours.
Some information about the server:
kernelrelease => 3.10.0-514.6.1.el7.x86_64
lsbdistdescription => Red Hat Enterprise Linux Server release 7.3
(Maipo)
It's a VM under VMware esx, 2 cores, 2 gig memory, ext4 / LVM. All of the
affected systems appear to be Red Hat 7, all patched within the last 30
days.
Any suggestions where to look next?
Thanks in advance!
--John
--
---
You received this message because you are subscribed to the Google Groups
"ossec-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.