Hello,
I'm currently evaluating a setup for a customer who has very specific needs in
terms of reliability and security. Setting up a minimal test environment for a
pilot installation I ran into an interesting problem - maybe someone has seen
it before and can give me a hint.
The basic setup consists of two zones, one called 'master' and one 'satellite'.
Both zones contain two endpoints:
object Zone "global" {
global = true
}
object Endpoint "icinga2-master1.vm.hindenburgring.com" {
host = "icinga2-master1.vm.hindenburgring.com"
}
object Endpoint "icinga2-master2.vm.hindenburgring.com" {
host = "icinga2-master2.vm.hindenburgring.com"
}
object Endpoint "icinga2-satellite1.vm.hindenburgring.com" {
host = "icinga2-satellite1.vm.hindenburgring.com"
}
object Endpoint "icinga2-satellite2.vm.hindenburgring.com" {
host = "icinga2-satellite2.vm.hindenburgring.com"
}
object Zone "master" {
endpoints = [ "icinga2-master1.vm.hindenburgring.com",
"icinga2-master2.vm.hindenburgring.com" ]
}
object Zone "satellite" {
parent = "master"
endpoints = [ "icinga2-satellite1.vm.hindenburgring.com",
"icinga2-satellite2.vm.hindenburgring.com" ]
}
In the final setup there will be seven satellite zones and the zones will also
get some command execution nodes, but the problem is reproducible with this
minimal setup.
Perfdata storare for the environment is using Graphite on a separate cluster,
behind HAproxy and two carbon-relay instances for data replication. This setup
has already been tested and works reliably. Writing perfdata from all Icinga 2
nodes involved also works very reliably, all the data are written without any
issues. So far it's perfect (and exactly what the customer requires).
The problem starts when I bring up the second satellite endpoint. Load on the
Grahite cluster increases massively, and the number of updates per second jumps
from about 6-7 to something in the range of 3000(!) with less than 350 Whisper
files in Graphite. When I enable update logging on Graphite I see that the vast
majority of the updates come from one of the satellite endpoints:
> [...]
> 06/12/2015 15:35:08 :: wrote 441 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_ok.value
> in 0.00660 seconds
> 06/12/2015 15:35:08 :: wrote 442 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.min_execution_time.value
> in 0.00105 seconds
> 06/12/2015 15:35:08 :: wrote 442 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_flapping.value
> in 0.00077 seconds
> 06/12/2015 15:35:08 :: wrote 442 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_up.value
> in 0.00419 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_unknown.value
> in 0.00075 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.avg_latency.value
> in 0.00070 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.max_execution_time.value
> in 0.00063 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_pending.value
> in 0.00075 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_warning.value
> in 0.00066 seconds
> [...]
The extreme load drops to normal values immediately once I shut down either of
the satellites - it doesn't have to be the one that's doing the extreme amount
of perfdata writing.
The behaviour is absolutely reproducible. Since I set up the Icinga 2 clusters
with Ansible/Vagrant it isn't a problem to delete the whole environment and set
it up from scratch, and I did that several times without any change - as soon
as the second satellite is started the metric rate skyrockets, without any
other issues (even the measurements look absolutely OK). No errors or warnings
whatsoever can be seen in the 'normal' Icinga 2 log at any time.
Shutting down a master node does not have any effect on the phaenomenon.
Since I don't have much experience with debugging Icinga 2 HA issues, I can't
tell much from the debug logs. If it helps, I can provide debug logs from the
four nodes that I sampled during a test with several startup/shutdown cycles of
the satellites. It can be seen from the debug log of the primary master that
the metric write rate increases extremely when both satellites are up and
connected. In the satellite's debug log there is an extreme frequency of log
entries of the form
> [2015-12-06 17:00:44 +0000] notice/ApiListener: Relaying 'event::CheckResult'
> message
> [2015-12-06 17:00:44 +0000] notice/ApiListener: Sending message to
> 'icinga2-satellite1.vm.hindenburgring.com'
> [2015-12-06 17:00:44 +0000] notice/JsonRpcConnection: Received
> 'event::CheckResult' message from 'icinga2-satellite1.vm.hindenburgring.com'
> [2015-12-06 17:00:44 +0000] debug/Checkable: No origin or local origin for
> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting
> icinga2-satellite2.vm.hindenburgring.com as check_source.
> [2015-12-06 17:00:44 +0000] debug/Checkable: command_endpoint found for
> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting
> icinga2-satellite1.vm.hindenburgring.com as check_source.
These messages appear in quantities of around 350 per second as long as both
satellites are active and stop immediately once one of them is shut down. Not
sure if this points to the root cause, but it's definitely connected.
I would be very grateful for any idea that helps me sorting out what could be
wrong.
Best regards,
Peter.
_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users