My two cents On 05/16/2017 09:01 AM, Valentin Höbel wrote: > Dear list, > > I have three questions regarding IDO DB and I hope somone here can point me > into the right direction. > > a) IDO DB failover > --------------------------------------- > I noticed that there is a failover of the active IDO DB writer when the > current IDO DB writer (Icinga2) is powered off. > After the timeout period of 60 seconds, another Icinga2 master starts to > write to the database. > > However, when I kill MariaDB on the server where Icinga2 is currently the IDO > DB writer, there is no failover. > Instead, the Icinga2 instance (which no longer can't write to localhost:3306) > remains the active IDO DB writer. > > Is this something which happens on purpose? For me, it is clear that there is > a failover when Icinga2 is stopped; but it is not clear to me that there is > no failover although Icinga2 can't no longer access the database. Looking at > the official documentation I saw no hint why Icinga2 behaves this way.
I tested this and also see it does not failover when I cut off network access to my DB cluster's VIP. However, see below regarding buffering. This failure scenario would require a serious network issue if your database cluster was architected to be highly available (any client node can connect to any server node). That's not to say perhaps it could not be handled better. > > > b) Buffering during IDO DB downtime > --------------------------------------- > Let's say I shutdown all database instances while Icinga2 on the masters is > still running. > Icinga2 is now not able to write to the database. > Let's say I start the database instances again after 10 minutes. > We had 10 minutes DB downtime. > > - Did Icinga2 buffer/safe the history/queries/the data from the last > 10 minutes? > - If yes, will Icinga2 try to write this historical data back to the database > as soon as it is available again? > - If yes, will Icinga2 throw away obsolete data, e.g. when a fresh check > result > came in and a service state changed? > > I am asking because I want to use the content of IDO DB for generating > reports. > And if Icinga2 doesn't buffer the DB writes during database downtime, the > report data will be incomplete. > > I ran a couple of tests and it seems that all data from the database downtime > is missing (e.g. check results/service states which are written to IDO DB). When I cut off access to the database from one Icinga2 master (db writer), queries were buffered and resumed once network access returned. In your setup, that would equate to the Icinga2 node's local Galera member recovering. Do you have logs with "notice/WorkQueue" and "Ido*Connection" with "tasks: <NUMBER>" in it? When I cutoff access to the database, this number increases with the buffered queries as checks are performed, etc. Can you provide these similar logs during the database downtime you tested? Once access to the database is allowed, do these decrease/queries execute again and drain the queue (num tasks decreases)? You may have to adjust your logging level. See below for the failover of the queue itself: > > > c) Query queuing > --------------------------------------- > - I noticed some query queuing stuff in the Icinga2 code and just wanted to > know if Icinga2 > uses a queuing mechanism for the database queries. > - If yes, what happens to the content of the queue when there is a failover > of the active > IDO DB writer - will the content of the query queue get synchronized to the > new IDO DB writer? > I cutoff access to the DB and let the Icinga2 master node/db writer's task queue build up to ~1,000 over several minutes. When I manually stopped the Icinga2 service, the first queries I saw after the DB failover were for the current time (indicating they were not processing the older queries queued by the stopped node). However the service stop did take a while, but tasks was never > 0 and I don't see any queries with timestamps that indicate they were from the shutdown node's query queue. In fairness, this would require a double-failure of both connectivity to the database cluster and then the Icinga2 master node/db writer node also failing. > > > My test setup: > --------------------------------------- > - 3 x master within a master zone > - couple of satellites within satellite zones > - Each server hosting an Icinga2 master also contains one MariaDB instance. > All 3 nodes are part of a Galera cluster. > - IDO DB is activated, ofc with HA enabled (should be the default anyway). > Icinga2 has 127.0.0.1 as the IDO DB host (yes, this is on purpose, we can't > use a floating Galera service IP here). For what it's worth, and I obviously don't know your restrictions, using Pacemaker with a VIP between the three Icinga2 master / Galera servers could resolve this (it wouldn't require an additional device or networking if the nodes already have connectivity/L2 adjacency). > > Beste regards > Valentin > _______________________________________________ icinga-users mailing list icinga-users@lists.icinga.org https://lists.icinga.org/mailman/listinfo/icinga-users