My two cents

On 05/16/2017 09:01 AM, Valentin Höbel wrote:
> Dear list,
> 
> I have three questions regarding IDO DB and I hope somone here can point me 
> into the right direction.
> 
> a) IDO DB failover
> ---------------------------------------
> I noticed that there is a failover of the active IDO DB writer when the 
> current IDO DB writer (Icinga2) is powered off.
> After the timeout period of 60 seconds, another Icinga2 master starts to 
> write to the database.
> 
> However, when I kill MariaDB on the server where Icinga2 is currently the IDO 
> DB writer, there is no failover.
> Instead, the Icinga2 instance (which no longer can't write to localhost:3306) 
> remains the active IDO DB writer.
> 
> Is this something which happens on purpose? For me, it is clear that there is 
> a failover when Icinga2 is stopped; but it is not clear to me that there is 
> no failover although Icinga2 can't no longer access the database. Looking at 
> the official documentation I saw no hint why Icinga2 behaves this way.

I tested this and also see it does not failover when I cut off network access 
to my DB cluster's VIP. However, see below regarding buffering.

This failure scenario would require a serious network issue if your database 
cluster was architected to be highly available (any client node can connect to 
any server node). That's not to say perhaps it could not be handled better.

> 
> 
> b) Buffering during IDO DB downtime
> ---------------------------------------
> Let's say I shutdown all database instances while Icinga2 on the masters is 
> still running.
> Icinga2 is now not able to write to the database.
> Let's say I start the database instances again after 10 minutes.
> We had 10 minutes DB downtime.
> 
> - Did Icinga2 buffer/safe the history/queries/the data from the last
>   10 minutes?
> - If yes, will Icinga2 try to write this historical data back to the database
>   as soon as it is available again?
> - If yes, will Icinga2 throw away obsolete data, e.g. when a fresh check 
> result
>   came in and a service state changed?
> 
> I am asking because I want to use the content of IDO DB for generating 
> reports.
> And if Icinga2 doesn't buffer the DB writes during database downtime, the 
> report data will be incomplete.
> 
> I ran a couple of tests and it seems that all data from the database downtime 
> is missing (e.g. check results/service states which are written to IDO DB).

When I cut off access to the database from one Icinga2 master (db writer), 
queries were buffered and resumed once network access returned. In your setup, 
that would equate to the Icinga2 node's local Galera member recovering.

Do you have logs with "notice/WorkQueue" and "Ido*Connection" with "tasks: 
<NUMBER>" in it?  When I cutoff access to the database, this number increases 
with the buffered queries as checks are performed, etc. Can you provide these 
similar logs during the database downtime you tested? Once access to the 
database is allowed, do these decrease/queries execute again and drain the 
queue (num tasks decreases)? You may have to adjust your logging level. See 
below for the failover of the queue itself:

> 
> 
> c) Query queuing
> ---------------------------------------
> - I noticed some query queuing stuff in the Icinga2 code and just wanted to 
> know if Icinga2
>    uses a queuing mechanism for the database queries.
> - If yes, what happens to the content of the queue when there is a failover 
> of the active
>   IDO DB writer - will the content of the query queue get synchronized to the 
> new IDO DB writer?
> 

I cutoff access to the DB and let the Icinga2 master node/db writer's task 
queue build up to ~1,000 over several minutes. When I manually stopped the 
Icinga2 service, the first queries I saw after the DB failover were for the 
current time (indicating they were not processing the older queries queued by 
the stopped node). However the service stop did take a while, but tasks was 
never > 0 and I don't see any queries with timestamps that indicate they were 
from the shutdown node's query queue.

In fairness, this would require a double-failure of both connectivity to the 
database cluster and then the Icinga2 master node/db writer node also failing.

> 
> 
> My test setup:
> ---------------------------------------
> - 3 x master within a master zone
> - couple of satellites within satellite zones
> - Each server hosting an Icinga2 master also contains one MariaDB instance.
>   All 3 nodes are part of a Galera cluster.
> - IDO DB is activated, ofc with HA enabled (should be the default anyway).
>   Icinga2 has 127.0.0.1 as the IDO DB host (yes, this is on purpose, we can't
>   use a floating Galera service IP here).

For what it's worth, and I obviously don't know your restrictions, using 
Pacemaker with a VIP between the three Icinga2 master / Galera servers could 
resolve this (it wouldn't require an additional device or networking if the 
nodes already have connectivity/L2 adjacency).

> 
> Beste regards
> Valentin
> 
_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

Reply via email to