Just to add to the numerous "agent disconnection" issues reported with 
OSSEC.

I have approximately 250 Centos agents that speak to a server and all have 
no issues connecting if both sides are restarted, but after a few hours 
they all begin to die off and when I come in the next morning some random 
amount will have disconnected (more than just a couple and not necessarily 
the same ones). If I restart the processes, they will ALL eventually 
reconnect, without exception.

I have run tcpdump on both sides and verified communication exists in both 
directions.

In most cases, it appear that the client sends a message, the server 
receives it, but never responds. 

I don't see anything in /var/ossec/logs/ossec.log and I've used strace on 
the remoted and monitord processes, but thus far have not been able to 
narrow down the issue.

In one instance, I watched 3 successive failures from the agent's 
perspective followed by the agent connecting in strace and while there are 
differences, I can't see the relevance.

Am I looking in the right place?

Here's an example of a failed client request on the server side trace 
(abbreviated).

1. recvfrom(4, ...) - I believe fd 4 is the network socket
2. stat("/queue/ossec/.wait...) -1 ENOENT ...
3. sendto(5, "1:(hostname.foo.bar.com) 1"... - I believe fd 5 is remoted 
process?

Here's what the successful attempt appear to look like:
1. recvfrom(4,...)
2. time(NULL)
3.time(NULL)
4.brk(some hex)
5.brk(some hex)
6.brk(some hex)
7.brk(some different hex)
8.write(222, "48:3720:", 8) - my guess is this is either 
/queue/rids/<agent_key> or /queue/rids/sender_counter
9.lseek(222,0, SEEK_SET)
10.sendto(4,...)

The replay protection check has been disabled, because I suspected it as a 
culprit. 

Any ideas?

Reply via email to