[icinga-users] Multi-tier HA Cluster: Massive performance problem with graphite

Peter Eckel Sun, 06 Dec 2015 09:22:06 -0800

Hello, 

I'm currently evaluating a setup for a customer who has very specific needs in 
terms of reliability and security. Setting up a minimal test environment for a 
pilot installation I ran into an interesting problem - maybe someone has seen 
it before and can give me a hint.


The basic setup consists of two zones, one called 'master' and one 'satellite'. 
Both zones contain two endpoints: 

object Zone "global" {
    global = true
}

object Endpoint "icinga2-master1.vm.hindenburgring.com" {
    host = "icinga2-master1.vm.hindenburgring.com"
}

object Endpoint "icinga2-master2.vm.hindenburgring.com" {
    host = "icinga2-master2.vm.hindenburgring.com"
}

object Endpoint "icinga2-satellite1.vm.hindenburgring.com" {
    host = "icinga2-satellite1.vm.hindenburgring.com"
}

object Endpoint "icinga2-satellite2.vm.hindenburgring.com" {
    host = "icinga2-satellite2.vm.hindenburgring.com"
}

object Zone "master" {
    endpoints = [ "icinga2-master1.vm.hindenburgring.com", 
"icinga2-master2.vm.hindenburgring.com" ]
}

object Zone "satellite" {
    parent = "master"
    endpoints = [ "icinga2-satellite1.vm.hindenburgring.com",  
"icinga2-satellite2.vm.hindenburgring.com"  ]
}

In the final setup there will be seven satellite zones and the zones will also 
get some command execution nodes, but the problem is reproducible with this 
minimal setup. 

Perfdata storare for the environment is using Graphite on a separate cluster, 
behind HAproxy and two carbon-relay instances for data replication. This setup 
has already been tested and works reliably. Writing perfdata from all Icinga 2 
nodes involved also works very reliably, all the data are written without any 
issues. So far it's perfect (and exactly what the customer requires).

The problem starts when I bring up the second satellite endpoint. Load on the 
Grahite cluster increases massively, and the number of updates per second jumps 
from about 6-7 to something in the range of 3000(!) with less than 350 Whisper 
files in Graphite. When I enable update logging on Graphite I see that the vast 
majority of the updates come from one of the satellite endpoints: 

> [...]
> 06/12/2015 15:35:08 :: wrote 441 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_ok.value
>  in 0.00660 seconds
> 06/12/2015 15:35:08 :: wrote 442 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.min_execution_time.value
>  in 0.00105 seconds
> 06/12/2015 15:35:08 :: wrote 442 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_flapping.value
>  in 0.00077 seconds
> 06/12/2015 15:35:08 :: wrote 442 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_up.value
>  in 0.00419 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_unknown.value
>  in 0.00075 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.avg_latency.value
>  in 0.00070 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.max_execution_time.value
>  in 0.00063 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_pending.value
>  in 0.00075 seconds
> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_warning.value
>  in 0.00066 seconds
> [...]

The extreme load drops to normal values immediately once I shut down either of 
the satellites - it doesn't have to be the one that's doing the extreme amount 
of perfdata writing. 

The behaviour is absolutely reproducible. Since I set up the Icinga 2 clusters 
with Ansible/Vagrant it isn't a problem to delete the whole environment and set 
it up from scratch, and I did that several times without any change - as soon 
as the second satellite is started the metric rate skyrockets, without any 
other issues (even the measurements look absolutely OK). No errors or warnings 
whatsoever can be seen in the 'normal' Icinga 2 log at any time. 

Shutting down a master node does not have any effect on the phaenomenon.

Since I don't have much experience with debugging Icinga 2 HA issues, I can't 
tell much from the debug logs. If it helps, I can provide debug logs from the 
four nodes that I sampled during a test with several startup/shutdown cycles of 
the satellites. It can be seen from the debug log of the primary master that 
the metric write rate increases extremely when both satellites are up and 
connected. In the satellite's debug log there is an extreme frequency of log 
entries of the form 

> [2015-12-06 17:00:44 +0000] notice/ApiListener: Relaying 'event::CheckResult' 
> message
> [2015-12-06 17:00:44 +0000] notice/ApiListener: Sending message to 
> 'icinga2-satellite1.vm.hindenburgring.com'
> [2015-12-06 17:00:44 +0000] notice/JsonRpcConnection: Received 
> 'event::CheckResult' message from 'icinga2-satellite1.vm.hindenburgring.com'
> [2015-12-06 17:00:44 +0000] debug/Checkable: No origin or local origin for 
> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting 
> icinga2-satellite2.vm.hindenburgring.com as check_source.
> [2015-12-06 17:00:44 +0000] debug/Checkable: command_endpoint found for 
> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting 
> icinga2-satellite1.vm.hindenburgring.com as check_source.

These messages appear in quantities of around 350 per second as long as both 
satellites are active and stop immediately once one of them is shut down. Not 
sure if this points to the root cause, but it's definitely connected. 

I would be very grateful for any idea that helps me sorting out what could be 
wrong.

Best regards, 

  Peter.

_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users

[icinga-users] Multi-tier HA Cluster: Massive performance problem with graphite

Reply via email to