Hi Lee,

thanks for taking the time for a) testing this stuff and b) writing an answer. I appreciate the efforts!

I tested this and also see it does not failover when I cut off
network access to my DB cluster's VIP. However, see below regarding buffering.

Good to hear that you experienced the same thing.


This failure scenario would require a serious network issue if your
database cluster was architected to be highly available (any client
node can connect to any server node). That's not to say perhaps
it could not be handled better.

That is not the point. It doesn't matter how "bad", "good" or "unusual" the Galera HA setup is. The point is that Icinga2 doesn't failover although it can't access the database. The Galera setup in this case is very special (and yes, I do have reasons to build it that way and no, sorry, since this is a customer project I can't tell you why, sorry, really).

> When I cut off access to the database from one Icinga2 master (db writer), queries were > buffered and resumed once network access returned. In your setup, that would equate
> to the Icinga2 node's local Galera member recovering.

Ok.

Do you have logs with "notice/WorkQueue" and "Ido*Connection" with
tasks: <NUMBER>" in it?  When I cutoff access to the database,
this number increases with the buffered queries as checks are performed,
etc. Can you provide these similar logs during the database downtime
you tested?

Oh yes, I noticed them too. I thought that they were indicating a query queue, 
but unfortunately I was not able to trace if they really get executed once 
Galera is back online. Instead, I was under the impression that Icinga2 rejects 
them once the DB was online again.


Once access to the database is allowed, do these decrease/queries execute again 
and drain
the queue (num tasks > decreases)? You may have to adjust your logging level. 
See
below for the failover of the queue itself:

You're right, the number did decrease.


> I cutoff access to the DB and let the Icinga2 master node/db writer's task queue build
> up to ~1,000 over several minutes.

> For what it's worth, and I obviously don't know your restrictions, using Pacemaker with > a VIP between the three Icinga2 master / Galera servers could resolve this (it wouldn't require > an additional device or networking if the nodes already have connectivity/L2 adjacency).

Your idea of using Pacemaker or a similar mechanism is not wrong, of course. 
Since I can't go into detail in this specific case you have no option but to 
trust me that there is a reason why I am doing this in another way :-)

Again, the point should not be that another DB HA design could "solve" an 
issue. For me, it is about how Icinga2 behaves in various situations and the reasons 
behind it.

For me, it is some sort of design flaw that Icinga2 didn't failover in this 
specific case, no matter if one could workaround it or not :P


I know I repeat myself, but hey: Thanks for the efforts, appreciate it!

Best regards,
Valentin


On 16.05.2017 18:43, Lee Clemens wrote:
My two cents

On 05/16/2017 09:01 AM, Valentin Höbel wrote:
Dear list,

I have three questions regarding IDO DB and I hope somone here can point me 
into the right direction.

a) IDO DB failover
---------------------------------------
I noticed that there is a failover of the active IDO DB writer when the current 
IDO DB writer (Icinga2) is powered off.
After the timeout period of 60 seconds, another Icinga2 master starts to write 
to the database.

However, when I kill MariaDB on the server where Icinga2 is currently the IDO 
DB writer, there is no failover.
Instead, the Icinga2 instance (which no longer can't write to localhost:3306) 
remains the active IDO DB writer.

Is this something which happens on purpose? For me, it is clear that there is a 
failover when Icinga2 is stopped; but it is not clear to me that there is no 
failover although Icinga2 can't no longer access the database. Looking at the 
official documentation I saw no hint why Icinga2 behaves this way.
I tested this and also see it does not failover when I cut off network access 
to my DB cluster's VIP. However, see below regarding buffering.

This failure scenario would require a serious network issue if your database 
cluster was architected to be highly available (any client node can connect to 
any server node). That's not to say perhaps it could not be handled better.


b) Buffering during IDO DB downtime
---------------------------------------
Let's say I shutdown all database instances while Icinga2 on the masters is 
still running.
Icinga2 is now not able to write to the database.
Let's say I start the database instances again after 10 minutes.
We had 10 minutes DB downtime.

- Did Icinga2 buffer/safe the history/queries/the data from the last
   10 minutes?
- If yes, will Icinga2 try to write this historical data back to the database
   as soon as it is available again?
- If yes, will Icinga2 throw away obsolete data, e.g. when a fresh check result
   came in and a service state changed?

I am asking because I want to use the content of IDO DB for generating reports.
And if Icinga2 doesn't buffer the DB writes during database downtime, the 
report data will be incomplete.

I ran a couple of tests and it seems that all data from the database downtime 
is missing (e.g. check results/service states which are written to IDO DB).
When I cut off access to the database from one Icinga2 master (db writer), 
queries were buffered and resumed once network access returned. In your setup, 
that would equate to the Icinga2 node's local Galera member recovering.

Do you have logs with "notice/WorkQueue" and "Ido*Connection" with "tasks: 
<NUMBER>" in it?  When I cutoff access to the database, this number increases with the buffered queries 
as checks are performed, etc. Can you provide these similar logs during the database downtime you tested? Once 
access to the database is allowed, do these decrease/queries execute again and drain the queue (num tasks 
decreases)? You may have to adjust your logging level. See below for the failover of the queue itself:


c) Query queuing
---------------------------------------
- I noticed some query queuing stuff in the Icinga2 code and just wanted to 
know if Icinga2
    uses a queuing mechanism for the database queries.
- If yes, what happens to the content of the queue when there is a failover of 
the active
   IDO DB writer - will the content of the query queue get synchronized to the 
new IDO DB writer?

I cutoff access to the DB and let the Icinga2 master node/db writer's task queue 
build up to ~1,000 over several minutes. When I manually stopped the Icinga2 
service, the first queries I saw after the DB failover were for the current time 
(indicating they were not processing the older queries queued by the stopped 
node). However the service stop did take a while, but tasks was never > 0 and I 
don't see any queries with timestamps that indicate they were from the shutdown 
node's query queue.

In fairness, this would require a double-failure of both connectivity to the 
database cluster and then the Icinga2 master node/db writer node also failing.


My test setup:
---------------------------------------
- 3 x master within a master zone
- couple of satellites within satellite zones
- Each server hosting an Icinga2 master also contains one MariaDB instance.
   All 3 nodes are part of a Galera cluster.
- IDO DB is activated, ofc with HA enabled (should be the default anyway).
   Icinga2 has 127.0.0.1 as the IDO DB host (yes, this is on purpose, we can't
   use a floating Galera service IP here).
For what it's worth, and I obviously don't know your restrictions, using 
Pacemaker with a VIP between the three Icinga2 master / Galera servers could 
resolve this (it wouldn't require an additional device or networking if the 
nodes already have connectivity/L2 adjacency).

Beste regards
Valentin

_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

--
Valentin Höbel
Senior Consultant IT Infrastructure
mobil 0711-95337077

open*i GmbH
Talstraße 41 70188 Stuttgart Germany

Geschäftsführer Tilo Mey
Amtsgericht Stuttgart,  HRB 729287, Ust-IdNr DE264295269
Volksbank Stuttgart EG, BIC VOBADESS, IBAN DE75600901000340001003

_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

Reply via email to