I’m having an odd issue with one particular server at one of our customers. We 
have Icinga2 set up in the "command execution bridge" scenario, where no hosts 
and services are configured out on the satellites, instead hosts and services 
are configured only on the central "master" node, that uses command_endpoint to 
execute the remote checks.

The satellite icinga2 instances are predominantly Windows Server 2008 R2, just 
like this one, and they all work fine (including other machines on the same 
site!), except this one, where the cluster connection just fails and then just 
doesn't re-establish. The master instance is running Ubuntu Linux 14.04.

The infuriating thing is that there's *nothing* useful in the log files to go 
on. Looking on the server side, everything works, until it just doesn't, with 
no intervening errors. I see successful checks, being sent and results 
received. I also see events like this every 10 seconds, then suddenly they just 
stop coming:

[2015-11-02 09:22:48 +0100] notice/ApiClient: Received 'event::Heartbeat' 
message from 'srv03.example.com'

And then after a bit over a minute:

[2015-11-02 09:23:59 +0100] information/ApiClient: No messages for identity 
'srv03.example.com' have been received in the last 60 seconds.

The log files on the satellite side are equally unhelpful. All I can see is:

[2015-11-02 09:23:50 Västeuropa, normaltid] information/ApiClient: No messages 
for identity 'icinga.example.com' have been received in the last 60 seconds.
[2015-11-02 09:23:50 Västeuropa, normaltid] warning/ApiClient: API client 
disconnected for identity 'icinga.example.com'
[2015-11-02 09:23:50 Västeuropa, normaltid] warning/ApiListener: Removing API 
client for endpoint 'icinga.example.com'. 0 API clients left.
[2015-11-02 09:23:55 Västeuropa, normaltid] information/ApiClient: Reconnecting 
to API endpoint 'icinga.example.com' via host '192.0.2.237' and port '5665'

It then never appears to actually manage to reconnect, and no failures or 
retries are logged.

The failure occurs intermittently, once as little as 10 minutes after 
restarting, other times it can be hours...

I'm running Icinga 2.3.11 on both the satellite and master.

Any insight into this problem (that right now appears like a black box to me), 
or at least ideas of what I can look at would be appreciated.

-- 
Per von Zweigbergk
IT-assistans Sverige AB
_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users

Reply via email to