Hi Lee,
thanks for taking the time for a) testing this stuff and b) writing an
answer. I appreciate the efforts!
I tested this and also see it does not failover when I cut off
network access to my DB cluster's VIP. However, see below regarding buffering.
Good to hear that you experienced the same thing.
This failure scenario would require a serious network issue if your
database cluster was architected to be highly available (any client
node can connect to any server node). That's not to say perhaps
it could not be handled better.
That is not the point. It doesn't matter how "bad", "good" or "unusual"
the Galera HA setup is. The point is that Icinga2 doesn't failover
although it can't access the database. The Galera setup in this case is
very special (and yes, I do have reasons to build it that way and no,
sorry, since this is a customer project I can't tell you why, sorry,
really).
> When I cut off access to the database from one Icinga2 master (db
writer), queries were
> buffered and resumed once network access returned. In your setup,
that would equate
> to the Icinga2 node's local Galera member recovering.
Ok.
Do you have logs with "notice/WorkQueue" and "Ido*Connection" with
tasks: <NUMBER>" in it? When I cutoff access to the database,
this number increases with the buffered queries as checks are performed,
etc. Can you provide these similar logs during the database downtime
you tested?
Oh yes, I noticed them too. I thought that they were indicating a query queue,
but unfortunately I was not able to trace if they really get executed once
Galera is back online. Instead, I was under the impression that Icinga2 rejects
them once the DB was online again.
Once access to the database is allowed, do these decrease/queries execute again
and drain
the queue (num tasks > decreases)? You may have to adjust your logging level.
See
below for the failover of the queue itself:
You're right, the number did decrease.
> I cutoff access to the DB and let the Icinga2 master node/db writer's
task queue build
> up to ~1,000 over several minutes.
> For what it's worth, and I obviously don't know your restrictions,
using Pacemaker with
> a VIP between the three Icinga2 master / Galera servers could resolve
this (it wouldn't require
> an additional device or networking if the nodes already have
connectivity/L2 adjacency).
Your idea of using Pacemaker or a similar mechanism is not wrong, of course.
Since I can't go into detail in this specific case you have no option but to
trust me that there is a reason why I am doing this in another way :-)
Again, the point should not be that another DB HA design could "solve" an
issue. For me, it is about how Icinga2 behaves in various situations and the reasons
behind it.
For me, it is some sort of design flaw that Icinga2 didn't failover in this
specific case, no matter if one could workaround it or not :P
I know I repeat myself, but hey: Thanks for the efforts, appreciate it!
Best regards,
Valentin
On 16.05.2017 18:43, Lee Clemens wrote:
My two cents
On 05/16/2017 09:01 AM, Valentin Höbel wrote:
Dear list,
I have three questions regarding IDO DB and I hope somone here can point me
into the right direction.
a) IDO DB failover
---------------------------------------
I noticed that there is a failover of the active IDO DB writer when the current
IDO DB writer (Icinga2) is powered off.
After the timeout period of 60 seconds, another Icinga2 master starts to write
to the database.
However, when I kill MariaDB on the server where Icinga2 is currently the IDO
DB writer, there is no failover.
Instead, the Icinga2 instance (which no longer can't write to localhost:3306)
remains the active IDO DB writer.
Is this something which happens on purpose? For me, it is clear that there is a
failover when Icinga2 is stopped; but it is not clear to me that there is no
failover although Icinga2 can't no longer access the database. Looking at the
official documentation I saw no hint why Icinga2 behaves this way.
I tested this and also see it does not failover when I cut off network access
to my DB cluster's VIP. However, see below regarding buffering.
This failure scenario would require a serious network issue if your database
cluster was architected to be highly available (any client node can connect to
any server node). That's not to say perhaps it could not be handled better.
b) Buffering during IDO DB downtime
---------------------------------------
Let's say I shutdown all database instances while Icinga2 on the masters is
still running.
Icinga2 is now not able to write to the database.
Let's say I start the database instances again after 10 minutes.
We had 10 minutes DB downtime.
- Did Icinga2 buffer/safe the history/queries/the data from the last
10 minutes?
- If yes, will Icinga2 try to write this historical data back to the database
as soon as it is available again?
- If yes, will Icinga2 throw away obsolete data, e.g. when a fresh check result
came in and a service state changed?
I am asking because I want to use the content of IDO DB for generating reports.
And if Icinga2 doesn't buffer the DB writes during database downtime, the
report data will be incomplete.
I ran a couple of tests and it seems that all data from the database downtime
is missing (e.g. check results/service states which are written to IDO DB).
When I cut off access to the database from one Icinga2 master (db writer),
queries were buffered and resumed once network access returned. In your setup,
that would equate to the Icinga2 node's local Galera member recovering.
Do you have logs with "notice/WorkQueue" and "Ido*Connection" with "tasks:
<NUMBER>" in it? When I cutoff access to the database, this number increases with the buffered queries
as checks are performed, etc. Can you provide these similar logs during the database downtime you tested? Once
access to the database is allowed, do these decrease/queries execute again and drain the queue (num tasks
decreases)? You may have to adjust your logging level. See below for the failover of the queue itself:
c) Query queuing
---------------------------------------
- I noticed some query queuing stuff in the Icinga2 code and just wanted to
know if Icinga2
uses a queuing mechanism for the database queries.
- If yes, what happens to the content of the queue when there is a failover of
the active
IDO DB writer - will the content of the query queue get synchronized to the
new IDO DB writer?
I cutoff access to the DB and let the Icinga2 master node/db writer's task queue
build up to ~1,000 over several minutes. When I manually stopped the Icinga2
service, the first queries I saw after the DB failover were for the current time
(indicating they were not processing the older queries queued by the stopped
node). However the service stop did take a while, but tasks was never > 0 and I
don't see any queries with timestamps that indicate they were from the shutdown
node's query queue.
In fairness, this would require a double-failure of both connectivity to the
database cluster and then the Icinga2 master node/db writer node also failing.
My test setup:
---------------------------------------
- 3 x master within a master zone
- couple of satellites within satellite zones
- Each server hosting an Icinga2 master also contains one MariaDB instance.
All 3 nodes are part of a Galera cluster.
- IDO DB is activated, ofc with HA enabled (should be the default anyway).
Icinga2 has 127.0.0.1 as the IDO DB host (yes, this is on purpose, we can't
use a floating Galera service IP here).
For what it's worth, and I obviously don't know your restrictions, using
Pacemaker with a VIP between the three Icinga2 master / Galera servers could
resolve this (it wouldn't require an additional device or networking if the
nodes already have connectivity/L2 adjacency).
Beste regards
Valentin
_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users
--
Valentin Höbel
Senior Consultant IT Infrastructure
mobil 0711-95337077
open*i GmbH
Talstraße 41 70188 Stuttgart Germany
Geschäftsführer Tilo Mey
Amtsgericht Stuttgart, HRB 729287, Ust-IdNr DE264295269
Volksbank Stuttgart EG, BIC VOBADESS, IBAN DE75600901000340001003
_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users