Hello,
the infra team has met to review the past issues, and would like to make
its analysis public to the community:
I. introduction
We have three big servers running, called dauntless, excelsior and
falco. The first two were put online in October, the last one in
December. The planned platform was oVirt with Gluster, with CentOS as
the base system. All servers are comparable in their hardware setup,
with 256 GB of RAM, one internal and one external Gbit networking card,
server-grade mainboard with IPMI, several HDDs, and 64 core CPU. All
three hosts were connected internally via a Gbit link, with oVirt
managing all of them, and Gluster being the underlying network file system.
II. before going productive
Extensive tests were carried out before going live with the platform.
The concrete CentOS, oVirt and Gluster version used for production was
tested twice, on separate and on the actual hardware, with the IPMI and
BIOS versions used later; that included two desaster simulations, where
one host was disconnected unannounced and oVirt behaved exactly as
expected - running the VMs on the other host without interruption.
When the platform was ready, to not endanger anything, first all
non-critical VMs were migrated, i.e. mainly testing VMs where a downtime
is not critical, but that still produce quite some load. Working
exclusively with these was done over several weeks, with no problems
detected.
After that, several other VMs were migrated, including Gerrit, and the
system worked fine for weeks with no I/O issues and no downtime.
III. downtime before FOSDEM
The first issues happened from Wednesday, January 28, around 1440 UTC,
until Thursday, January 29, early morning. Falco was disconnected for up
to a couple of minutes. The reason of this is still unclear.
- The infra team is looking for an oVirt expert who can help us
parse the logs to better understand what has happened.
At the same time, on an unrelated reason, a file system corruption on
excelsior was discovered, which is a failure that has been in existence
since January 5 already.
- The monitoring scripts, based on snmpd, claimed everything was
ok. Scripts have already been enhanced to properly report the status.
Each of the errors on their own, if detected in time, would have caused
no downtime at all.
With that, 2 our of 3 Gluster bricks were down, and the platform was
stopped. (Gluster is comparable to RAID 5 here.) Gluster detected a
possible split-brain situation, so a manual recovery was required. The
actual start of the fix was not complicated, in comparison to other
networking file systems, and could easily be handled, but the recovery
took long due to the amount of data already on the platform and the
internal Gbit connectivity. Depending on the VM, the downtime was
between 3 and 18 hours. oVirt's database also had issues, which could
however be fixed. In other words, the reason for most of the downtime
was not finding a fix, but waiting for it to be completed.
- Situations like these can be less time-consuming with an
(expensive) internal 10 Gbit connection, or with a (slower, but more
redundant and cheaper) internal trunked/bonded x * 1 Gbit connection,
which we will be looking into.
- Work in progress currently is an SMS monitoring system where we
seek for volunteers to be included in the notification. SMS notification
is to be sent out in case of severe issues and can be combined with
Android tools like Alarmbox.
- In the meantime, we have also fine-tuned the alerts and
thresholds to distinguish messages.
All infra members including several volunteers were working jointly on
getting the services back together. However, we experienced some issues
with individual VMs, where it was unclear who is responsible for them,
and where documentation was partially missing or outdated. It worked out
in the end, however.
- Infra will enforce a policy for new and existing services. At
least 2 responsible maintainers are required per service, including
proper documentation. That will be announced with a fair deadline.
Services not fulfilling those requirements will be moved from production
to test. A concrete policy is still to be drafted with the public.
On a side note, we discovered, although it has worked fine for months,
and survived two desaster simulations, that oVirt does not support
Gluster running on the internal interface, and the hosted engine and
management on the external interface. This fact is undocumented in the
oVirt documentation and was discovered by Alex during FOSDEM, when he
attended an oVirt talk, where this was mentioned as a side-note.
- An option is to look into SAN solutions, which are not only
faster, but also probably more reliable. We might have some supporting
offer here that needs looking into.
During FOSEM, Alex also got in touch with one oVirt and one Gluster
developer. We also talked to a