Which versions of OSSEC were running on the server and on agents?
On Monday, January 14, 2013 4:51:38 PM UTC-8, Tony Trummer wrote:
>
> Just to add to the numerous "agent disconnection" issues reported with
> OSSEC.
>
> I have approximately 250 Centos agents that speak to a server and all have
> no issues connecting if both sides are restarted, but after a few hours
> they all begin to die off and when I come in the next morning some random
> amount will have disconnected (more than just a couple and not necessarily
> the same ones). If I restart the processes, they will ALL eventually
> reconnect, without exception.
>
> I have run tcpdump on both sides and verified communication exists in both
> directions.
>
> In most cases, it appear that the client sends a message, the server
> receives it, but never responds.
>
> I don't see anything in /var/ossec/logs/ossec.log and I've used strace on
> the remoted and monitord processes, but thus far have not been able to
> narrow down the issue.
>
> In one instance, I watched 3 successive failures from the agent's
> perspective followed by the agent connecting in strace and while there are
> differences, I can't see the relevance.
>
> Am I looking in the right place?
>
> Here's an example of a failed client request on the server side trace
> (abbreviated).
>
> 1. recvfrom(4, ...) - I believe fd 4 is the network socket
> 2. stat("/queue/ossec/.wait...) -1 ENOENT ...
> 3. sendto(5, "1:(hostname.foo.bar.com) 1"... - I believe fd 5 is remoted
> process?
>
> Here's what the successful attempt appear to look like:
> 1. recvfrom(4,...)
> 2. time(NULL)
> 3.time(NULL)
> 4.brk(some hex)
> 5.brk(some hex)
> 6.brk(some hex)
> 7.brk(some different hex)
> 8.write(222, "48:3720:", 8) - my guess is this is either
> /queue/rids/<agent_key> or /queue/rids/sender_counter
> 9.lseek(222,0, SEEK_SET)
> 10.sendto(4,...)
>
> The replay protection check has been disabled, because I suspected it as a
> culprit.
>
> Any ideas?
>