[foreman-dev] Koji builder crash - days after

Lukas Zapletal Mon, 30 Oct 2017 02:22:53 -0700

Hello,

the reason why our Koji was out of service last week was a hardware
failure. The instance was respinned on a different hypervisor but due
to ephemeral storage mounted as swap and scratch disks the OS did not
come up and went into emergency mode. I was surprised frankly because
I expected the system to boot up (root volume was ok), anyway lesson
learned.


After several hours of outage, we were able to bring it up by mounting
the volume in a temporary VM, editing /etc/fstab and starting new
instance. I made some changes - cleaned up fstab and dropped
everything except the root volume. Everything else is configured in
rc.local now, so the instance should boot up on a different machine or
configuration just fine as long as the root volume is /dev/sda1.

Started new wiki page where we have this information:

http://projects.theforeman.org/projects/foreman/wiki/KojiSetup

There were voices on the IRC to puppetize this server, I am not
against and feel free to add this to todo. It does not make much sense
IMHO to puppetize koji setup, but things like setting up ssh keys or
basic services can be useful.

The wikipage now follows, I recommand to read on the wiki rather than
here, there might be updates already:

***

h1. Koji Setup

Our intance is running at AWS EC2 (us-east-1) as i3.xlarge instance (4
CPUs, 32 GB RAM, 900 GB SSD NVMe). It is running CentOS 7 from EBS
volume (8 GB). The account is managed by Bryan Kearney, access to the
instance has few people including Lukas Zapletal, Eric Helms and Mike
McCune. If you need to be there, contact them.

h2. Volumes and mounts

The instance has two EBS volumes attached:

* /dev/sda1 - root
* /dev/sdx - data volume (/mnt/koji available as /dev/xvdx1)

The instance must be running in a security group with ports 22, 80,
443, 873 (rsyncd), 44323 (read only monitoring PCP web app) allowed
(all IPv4 TCP).

Root EBS volume is mounted via UUID in fstab:

<pre>
UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 / xfs defaults 0 0
</pre>

Note that other volumes are not present in fstab, this is to prevent
booting into emergency mode when the VM is respinned on a different
hypervisor with different or empty ephemeral or EBS storage
configuration. All the rest is mounted in /etc/rc.local:

<pre>
swapon /dev/nvme0n1p1
mount /dev/nvme0n1p2 /mnt/tmp -o defaults,noatime,nodiratime
mount /dev/xvdx1 /mnt/koji -o defaults,noatime,nodiratime
hostnamectl set-hostname koji.katello.org
systemctl restart pmcd pmlogger pmwebd
mount | grep /mnt/koji && systemctl restart rsyncd
mount | grep /mnt/koji && systemctl start postgresql
systemctl start httpd
mount | grep /mnt/koji && mount | grep /mnt/tmp && systemctl start kojid
mount | grep /mnt/koji && mount | grep /mnt/tmp && systemctl start kojira
</pre>

On our current VM flavour there is a local SSD NVMe storage
(/dev/nvme0n1) with two partitions created (50/50). The first one is
swap and the second one is mounted as /mnt/tmp where koji does all the
work. This volume needs to be fast, it grows over the time and
contains temporary files (built packages, build logs, support files).

The main data folder where PostgreSQL database and koji generated
repositories and external repositories are present is on EBS volume
mounted as /mnt/koji. Note this was created as ext4 which can
sometimes lead to mkfs, perhaps xfs would be better fit for our use
case.

Services required for koji (postgresql, httpd, kojid, kojira, rsyncd)
are only started if required volumes are mounted.

h2. Hostname

The instance has a floating IP, in /etc/hosts we have an entry for that:

34.224.159.44 koji.katello.org kojihub.katello.org koji kojihub

When the IP changes, make sure this does change as well.

When new instance is booted via AWS, it will have a random hostname
assigned. In the rc.local we set the hostname to koji.katello.org on
every boot.

h2. Backups

There is a cron job (/etc/cron.weekly/koji-backup) that performs two
actions every week:

Full PostgreSQL database dump into /mnt/koji/backups/postgres.

File system backup of /mnt/tmp (ephemeral storage) into
/mnt/koji/backups/ephemeral. This backup skips all files named RPM
(these are not needed), duplicity tool is used, no encryption is done.
The main purpose of this backup is to store required filesystem
structure so koji can be quickly brought up after crash. Since the
backup mostly contains directories and build logs, it is not big. To
restore that, use:

duplicity restore file:///mnt/koji/backups/ephemeral /mnt/tmp --force
--no-encryption

Both backups does not have any rotation and need to be deleted every
year. The full backup script looks like:

<pre>
#!/bin/bash
/usr/bin/duplicity --full-if-older-than 1M --no-encryption -vWARNING
--exclude '/mnt/tmp/**/*rpm' /mnt/tmp
file:///mnt/koji/backups/ephemeral
date=`date +"%Y%m%d"`
filename="/mnt/koji/backups/postgres/koji_${date}.dump"
pg_dump -Fc -f "$filename" -U koji koji
</pre>

h2. Updates

We are running CentOS 7 with Koji (1.11) installed from EPEL7 and
mrepo package installed from Fedora. Since koji was bumped in the EPEL
to non-compatible version, we have disabled EPEL7 repository for now.

Update procedure:

* Shutdown the instance.
* Make a root EBS volume snapshot.
* Start the instance and check koji is operating properly.
* Perform the yum upgrade in screen session. Do not let koji to upgrade.
* Reboot and check koji services.
* Send announcement and edit the history section down here.

If update fails, shutdown the failed VM, create new AMI from the
snapshot, mount the data EBS volume and spawn a new instance.

h2. Monitoring

There is a PCP daemon (pmcd) running on the instance and pmlogger
active creating archives of performance data in /var/log/pcp/pmlogger
(30 days rotation). It is possible to connect and see live data at
http://koji.katello.org:44323 (it also have graphite API available on
this URL).

Local postfix instance is not configured yet, e-mails from cron are
not coming out to the wild.

h4. History

* Summer 2017 - Instance was installed according to Koji wiki guide.
* 2017/10/27 - Hardware failure. Instance was relaunched on a
different hypervisor. Edited fstab and this page was created.

-- 
Later,
  Lukas @lzap Zapletal

-- 
You received this message because you are subscribed to the Google Groups 
"foreman-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to foreman-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[foreman-dev] Koji builder crash - days after

Reply via email to