Just to add to the numerous "agent disconnection" issues reported with
OSSEC.
I have approximately 250 Centos agents that speak to a server and all have
no issues connecting if both sides are restarted, but after a few hours
they all begin to die off and when I come in the next morning some random
amount will have disconnected (more than just a couple and not necessarily
the same ones). If I restart the processes, they will ALL eventually
reconnect, without exception.
I have run tcpdump on both sides and verified communication exists in both
directions.
In most cases, it appear that the client sends a message, the server
receives it, but never responds.
I don't see anything in /var/ossec/logs/ossec.log and I've used strace on
the remoted and monitord processes, but thus far have not been able to
narrow down the issue.
In one instance, I watched 3 successive failures from the agent's
perspective followed by the agent connecting in strace and while there are
differences, I can't see the relevance.
Am I looking in the right place?
Here's an example of a failed client request on the server side trace
(abbreviated).
1. recvfrom(4, ...) - I believe fd 4 is the network socket
2. stat("/queue/ossec/.wait...) -1 ENOENT ...
3. sendto(5, "1:(hostname.foo.bar.com) 1"... - I believe fd 5 is remoted
process?
Here's what the successful attempt appear to look like:
1. recvfrom(4,...)
2. time(NULL)
3.time(NULL)
4.brk(some hex)
5.brk(some hex)
6.brk(some hex)
7.brk(some different hex)
8.write(222, "48:3720:", 8) - my guess is this is either
/queue/rids/<agent_key> or /queue/rids/sender_counter
9.lseek(222,0, SEEK_SET)
10.sendto(4,...)
The replay protection check has been disabled, because I suspected it as a
culprit.
Any ideas?