Was running wazuh 2.8.1 agent on "most" systems, with the wazuh ossec 
docker container for a master server.

Upgraded to 2.8.3 to try to resolve this problem, with no luck.

Out of about 160 machines, 4-5 of them will reliably wedge themselves after 
some amount of time with messages akin to:

2017 Feb 28 15:35:34 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck 
for 22s! [ossec-syscheckd:12608]
2017 Feb 28 15:36:02 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck 
for 22s! [ossec-syscheckd:12608]
2017 Feb 28 15:36:34 <server> NMI watchdog: BUG: soft lockup - CPU#0 stuck 
for 23s! [ossec-syscheckd:12608]

If this continues long enough, the entire system grinds to a halt, and 
requires Big Red Button service.

I finally managed to attach an strace to one today, but I may not have 
gotten it right.

# strace -e trace=read,write -p 12608

which displayed an awful lot of noise (I'd just done a clean reinstall of 
ossec-hids-agent) of the format:

read(7, "ST6=m\nCONFIG_CRYPTO_CAST6_AVX_X8"..., 1024) = 1024
read(7, "TO_DEV_QAT=m\nCONFIG_CRYPTO_DEV_Q"..., 1024) = 1024
read(7, "G_PERCPU_RWSEM=y\nCONFIG_ARCH_USE"..., 1024) = 1024
read(7, "NFIG_TEXTSEARCH_BM=m\nCONFIG_TEXT"..., 1024) = 479
read(7, "", 1024)                       = 0

before going into this loop:

 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13344, 
si_status=0,si_utime=0, si_stime=0}
 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13346, 
si_status=0,si_utime=0, si_stime=0}
 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13355, 
si_status=0,si_utime=0, si_stime=0}
 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13357, 
si_status=0,si_utime=0, si_stime=0}
 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13359, 
si_status=0,si_utime=0, si_stime=0}
 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13361, 
si_status=0,si_utime=0, si_stime=0}
 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13363, 
si_status=0,si_utime=0, si_stime=0}
.......
 SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13660, 
si_status=0,si_utime=0, si_stime=0}

and then the soft lockup messages started-- and no, I didn't think to 
attach an strace to pid 13660 until after I'd rebooted.

It's a production server, and while it's not heavily used, it's used enough 
that we don't want it off during production hours.

Some information about the server:

      kernelrelease => 3.10.0-514.6.1.el7.x86_64
      lsbdistdescription => Red Hat Enterprise Linux Server release 7.3 
(Maipo)

It's a VM under VMware esx, 2 cores, 2 gig memory, ext4 / LVM.  All of the 
affected systems appear to be Red Hat 7, all patched within the last 30 
days.

Any suggestions where to look next?

Thanks in advance!

--John

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"ossec-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to ossec-list+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to