Hello, the reason why our Koji was out of service last week was a hardware failure. The instance was respinned on a different hypervisor but due to ephemeral storage mounted as swap and scratch disks the OS did not come up and went into emergency mode. I was surprised frankly because I expected the system to boot up (root volume was ok), anyway lesson learned.
After several hours of outage, we were able to bring it up by mounting the volume in a temporary VM, editing /etc/fstab and starting new instance. I made some changes - cleaned up fstab and dropped everything except the root volume. Everything else is configured in rc.local now, so the instance should boot up on a different machine or configuration just fine as long as the root volume is /dev/sda1. Started new wiki page where we have this information: http://projects.theforeman.org/projects/foreman/wiki/KojiSetup There were voices on the IRC to puppetize this server, I am not against and feel free to add this to todo. It does not make much sense IMHO to puppetize koji setup, but things like setting up ssh keys or basic services can be useful. The wikipage now follows, I recommand to read on the wiki rather than here, there might be updates already: *** h1. Koji Setup Our intance is running at AWS EC2 (us-east-1) as i3.xlarge instance (4 CPUs, 32 GB RAM, 900 GB SSD NVMe). It is running CentOS 7 from EBS volume (8 GB). The account is managed by Bryan Kearney, access to the instance has few people including Lukas Zapletal, Eric Helms and Mike McCune. If you need to be there, contact them. h2. Volumes and mounts The instance has two EBS volumes attached: * /dev/sda1 - root * /dev/sdx - data volume (/mnt/koji available as /dev/xvdx1) The instance must be running in a security group with ports 22, 80, 443, 873 (rsyncd), 44323 (read only monitoring PCP web app) allowed (all IPv4 TCP). Root EBS volume is mounted via UUID in fstab: <pre> UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 / xfs defaults 0 0 </pre> Note that other volumes are not present in fstab, this is to prevent booting into emergency mode when the VM is respinned on a different hypervisor with different or empty ephemeral or EBS storage configuration. All the rest is mounted in /etc/rc.local: <pre> swapon /dev/nvme0n1p1 mount /dev/nvme0n1p2 /mnt/tmp -o defaults,noatime,nodiratime mount /dev/xvdx1 /mnt/koji -o defaults,noatime,nodiratime hostnamectl set-hostname koji.katello.org systemctl restart pmcd pmlogger pmwebd mount | grep /mnt/koji && systemctl restart rsyncd mount | grep /mnt/koji && systemctl start postgresql systemctl start httpd mount | grep /mnt/koji && mount | grep /mnt/tmp && systemctl start kojid mount | grep /mnt/koji && mount | grep /mnt/tmp && systemctl start kojira </pre> On our current VM flavour there is a local SSD NVMe storage (/dev/nvme0n1) with two partitions created (50/50). The first one is swap and the second one is mounted as /mnt/tmp where koji does all the work. This volume needs to be fast, it grows over the time and contains temporary files (built packages, build logs, support files). The main data folder where PostgreSQL database and koji generated repositories and external repositories are present is on EBS volume mounted as /mnt/koji. Note this was created as ext4 which can sometimes lead to mkfs, perhaps xfs would be better fit for our use case. Services required for koji (postgresql, httpd, kojid, kojira, rsyncd) are only started if required volumes are mounted. h2. Hostname The instance has a floating IP, in /etc/hosts we have an entry for that: 34.224.159.44 koji.katello.org kojihub.katello.org koji kojihub When the IP changes, make sure this does change as well. When new instance is booted via AWS, it will have a random hostname assigned. In the rc.local we set the hostname to koji.katello.org on every boot. h2. Backups There is a cron job (/etc/cron.weekly/koji-backup) that performs two actions every week: Full PostgreSQL database dump into /mnt/koji/backups/postgres. File system backup of /mnt/tmp (ephemeral storage) into /mnt/koji/backups/ephemeral. This backup skips all files named RPM (these are not needed), duplicity tool is used, no encryption is done. The main purpose of this backup is to store required filesystem structure so koji can be quickly brought up after crash. Since the backup mostly contains directories and build logs, it is not big. To restore that, use: duplicity restore file:///mnt/koji/backups/ephemeral /mnt/tmp --force --no-encryption Both backups does not have any rotation and need to be deleted every year. The full backup script looks like: <pre> #!/bin/bash /usr/bin/duplicity --full-if-older-than 1M --no-encryption -vWARNING --exclude '/mnt/tmp/**/*rpm' /mnt/tmp file:///mnt/koji/backups/ephemeral date=`date +"%Y%m%d"` filename="/mnt/koji/backups/postgres/koji_${date}.dump" pg_dump -Fc -f "$filename" -U koji koji </pre> h2. Updates We are running CentOS 7 with Koji (1.11) installed from EPEL7 and mrepo package installed from Fedora. Since koji was bumped in the EPEL to non-compatible version, we have disabled EPEL7 repository for now. Update procedure: * Shutdown the instance. * Make a root EBS volume snapshot. * Start the instance and check koji is operating properly. * Perform the yum upgrade in screen session. Do not let koji to upgrade. * Reboot and check koji services. * Send announcement and edit the history section down here. If update fails, shutdown the failed VM, create new AMI from the snapshot, mount the data EBS volume and spawn a new instance. h2. Monitoring There is a PCP daemon (pmcd) running on the instance and pmlogger active creating archives of performance data in /var/log/pcp/pmlogger (30 days rotation). It is possible to connect and see live data at http://koji.katello.org:44323 (it also have graphite API available on this URL). Local postfix instance is not configured yet, e-mails from cron are not coming out to the wild. h4. History * Summer 2017 - Instance was installed according to Koji wiki guide. * 2017/10/27 - Hardware failure. Instance was relaunched on a different hypervisor. Edited fstab and this page was created. -- Later, Lukas @lzap Zapletal -- You received this message because you are subscribed to the Google Groups "foreman-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to foreman-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.