Thank you Michael for the detailed information.
On Mon, Sep 28, 2020 at 4:37 PM Michael Scherer <msche...@redhat.com> wrote: > Hi, > > > The intro > ========= > > So we are trying to get the tests running on Centos 8. Upon > investigation, we had a few tests failing, and thanks to Deepshika work > on https://github.com/gluster/project-infrastructure/issues/20, this > narrowed that on 6 tests, and I decided to investigate them one by one. > > I fixed some of them infra side, but one was caused by 1 single bug > triggering faillure in 3 tests, and RH people pointed us to a kernel > issue: https://github.com/gluster/glusterfs/issues/1402 > > As we are running Centos 8.2, it should have been fixed, but it wasn't. > > So the question was "why is the kernel not up to date on our builder", > which will be our epic quest for today. > > The builders > ============ > > We are running our test builds on AWS EC2. We have a few centos 8 > builders, installed from the only public image we had at that time, a > unofficial one from Frontline. While I tend to prefer official images > (for reasons that will be clear later), this was the easiest way to get > it running, while waiting for the Centos 8 images that would for sure > be there "real soon". > > We have automated upgrade on the builders with cron, so they should be > using the latest kernel, and if that is not the case, it should just be > 1 reboot away. As we reboot on failure, and as kernel version seldomly > impact tests, we are usually good on that point, but maybe Centos 8 > scripts were a bit different. This was a WIP after all. > > I take the 1st builder I get, connect to it and check the kernel: > # uname -a > Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP > Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > > We need a newer one, so I reboot, I test again > > # uname -a > Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP > Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > > Ok, that's a problem. I check the grub configuration, indeed there is > no trace of anything but 1 single kernel. Curiously, there is also > trace of Centos 7 on the disk, a point that will be important for > later, and do not smell good: > > # cat /etc/grub.conf > default=0 > timeout=0 > > > title CentOS Linux 7 (3.10.0-957.1.3.el7.x86_64) > root (hd0) > kernel /boot/vmlinuz-3.10.0-957.1.3.el7.x86_64 ro > root=UUID=f41e390f-835b-4223-a9bb-9b45984ddf8d console=hvc0 > LANG=en_US.UTF-8 > initrd /boot/initramfs-3.10.0-957.1.3.el7.x86_64.img > > > So the configuration of the kernel is not changeded, which mean > something went wrong. > > The kernel > ========== > > The Linux kernel is a quite important part of the system and requires > some special care for upgrade. For example, you have to generate a > initramfs based on what is on the disk, you have to modify > configuration files around for grub/lilo/etc, etc, etc. > > While there is effort to move the mess from grub to the firmware of the > system with UEFI and systemd, we are not there yet, so on EL8, we are > still doing it the old way, with scripts run after packages > installations (called later scriptlets in this document). The said > script are shipped with the package, and in the case of kernel, the > work is done by /bin/kernel-install, as it can be seen by the commands > to show scriptlets: > > # rpm -q --scripts kernel-core-4.18.0-193.19.1.el8_2.x86_64 > postinstall scriptlet (using /bin/sh): > > if [ `uname -i` == "x86_64" -o `uname -i` == "i386" ] && > [ -f /etc/sysconfig/kernel ]; then > /bin/sed -r -i -e 's/^DEFAULTKERNEL=kernel- > smp$/DEFAULTKERNEL=kernel/' /etc/sysconfig/kernel || exit $? > fi > preuninstall scriptlet (using /bin/sh): > /bin/kernel-install remove 4.18.0-193.19.1.el8_2.x86_64 > /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $? > if [ -x /usr/sbin/weak-modules ] > then > /usr/sbin/weak-modules --remove-kernel 4.18.0-193.19.1.el8_2.x86_64 > || exit $? > fi > posttrans scriptlet (using /bin/sh): > if [ -x /usr/sbin/weak-modules ] > then > /usr/sbin/weak-modules --add-kernel 4.18.0-193.19.1.el8_2.x86_64 || > exit $? > fi > /bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64 > /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $? > > here, we can see that there is 3 scriptlets in shell, run in 3 > different phases of the package upgrade: > - postinstall > - preuninstall > - posttrans > > Postinstall is run after the installation of the package, preuninstall > is run before the removal, and posttrans is run once the whole > transaction is finished. See > https://rpm-packaging-guide.github.io/#triggers-and-scriptlets > > The interesting one is the posttrans one, since that's the one who > install the kernel configuration, and either it failed, or wasn't run. > > So to verify that, the next step is to run the command: > > /bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64 > /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz > > And from a quick look, this seemed to work fine. And a quick reboot > later, I confirmed it worked fine, and the kernel was now up to date. > > While I could have stopped here, I think it is important to find the > real problem. > > On the internal of rpm package upgrade > ====================================== > > Since the scriptlet who face issue is %posttrans, that mean the > transaction, eg, the complete upgrade, failed. %posttrans is run after > the transaction, if it was successful, and if that wasn't run, that > mean it wasn't successful. > > This is usually not something that happen, unless people reboot during > the upgrade (a seemingly bad idea, but that happened on a regular basis > in the past on end users laptops), or if a rpm failed to upgrade > properly. > > While the hypothesis of reboot during upgrade was tempting, there is no > way it could happen on 3 different unused systems several time. So I > went on the next step, run yum upgrade to check. And while looking at > the yum output, this line caught my eyes: > > Upgrading : yum-4.2.17- > 7.el8_2.noarch > > 129/526 > Error unpacking rpm package yum-4.2.17-7.el8_2.noarch > Upgrading : python3-dnf-plugins-core-4.0.12- > 4.el8_2.noarch > > 130/526 > error: unpacking of archive failed on file /etc/yum/pluginconf.d: cpio: > File from package already exists as a directory in system > error: yum-4.2.17-7.el8_2.noarch: install failed > > It seems that yum itself failed to be upgraded, which in turn mean the > whole transaction failed, which in turn mean that: > - %posttrans scriptlet was not run (so no bootloader config change) > - yum would try to upgrade again in the future (and fail again) > - all others packages would be on the disk, so kernel is installed, > just not used on boot > > The error "File from package already exists as a directory in system" > is quite specific, and to explain the issue, we must look at how rpm > upgrade files. > > The naive way of doing packages upgrades is to remove first the files, > and then add the new ones. But this is not exactly a good idea if there > is a issue during the upgrade, so rpm first add the new files, and > later remove what need to be removed. This way, you do not lose files > in case of problem such as reboot, hardware errors, etc. > > However, this scheme fail in a very specific case, replacing a > directory by a symlink/file, since you cannot create the symlink since > the directory is still here in the first place (especially if there is > something in the directory). There is various workarounds, see > > https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/ > > But this is a long standing well know limitation of various packaging > systems. > > The symptoms are that error message and finding the various files with > a random prefix on disk: > > # ls /etc/yum > fssnap.d > pluginconf.d.bak > 'pluginconf.d;5f719e89' > protected.d > 'protected.d;5f71a160' > 'protected.d;5f71a1be' > 'vars;5f71a8c8' > yum7 > pluginconf.d > 'pluginconf.d;5e4ba094' > 'pluginconf.d;5f71a052' > protected.d.bak > 'protected.d;5f71a18f' > vars > version-groups.conf > > So one way to fix is to get the directory out of the way, another is to > fix the package (cf packaging doc). So I decided to fix by moving > /etc/yum/pluginconf.d, it failed again with protected.d and vars. > > > So once that's done, yum can be upgraded, and so the system should > work, right ? > > > But as we are diving deep, why is there a upgrade issue in the first > place ? Where does this yum package come from ? Why no one reported > such a glaring problem ? > > > A quick look on a fresh VM show that it indeed come from Centos: > > [centos@ip-172-31-31-150 ~]$ rpm -qi yum > Name : yum > Version : 4.0.9.2 > Release : 5.el8 > Architecture: noarch > Install Date: Sat 26 Oct 2019 04:44:11 AM UTC > Group : Unspecified > Size : 60284 > License : GPLv2+ and GPLv2 and GPL > Signature : RSA/SHA256, Tue 02 Jul 2019 02:09:17 AM UTC, Key ID > 05b555b38483c65d > Source RPM : dnf-4.0.9.2-5.el8.src.rpm > Build Date : Mon 13 May 2019 07:35:13 PM UTC > Build Host : ppc64le-01.mbox.centos.org > Relocations : (not relocatable) > Packager : CentOS Buildsys <b...@centos.org> > Vendor : CentOS > URL : https://github.com/rpm-software-management/dnf > Summary : Package manager > Description : > Utility that allows users to manage packages on their systems. > It supports RPMs, modules and comps groups & environments. > > > It look like a legit rpm, so here goes my hope of being just able to blame > some bad rpm. But that's still odd. > > On why you shouldn't use non official image > ============================================ > > As I said, something weird about that image was the fact that Centos 7 was > in the configuration of grub 1, but > also the fact that grub1 config was around. Investigating, I found that > several files in /etc/yum were not owned by rpm, which is > kinda curious, cause that would mean some custom change on the image. > > For example, what is that yum7 directory: > > # rpm -qf /etc/yum/yum7/ > file /etc/yum/yum7 is not owned by any package > > A quick search on github lead me to this: > https://github.com/johnj/centos-to8-upgrade/blob/master/to8.sh#L86 > > So it seems that the image was created by using a Centos 7 image, then > upgraded in place to Centos 8, and then uploaded, hence > the left over files from Centos 7 around. But still, the rpm is signed, so > I need to verify that. > > I found the original package on > http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm > , > > After a quick download and extract: > $ curl > http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm > |rpm2cpio |cpio -id > > we can see .... > $ ls -l etc/yum > total 0 > lrwxrwxrwx. 1 misc misc 14 28 sept. 12:36 pluginconf.d -> ../dnf/plugins > lrwxrwxrwx. 1 misc misc 18 28 sept. 12:36 protected.d -> ../dnf/protected.d > lrwxrwxrwx. 1 misc misc 11 28 sept. 12:36 vars -> ../dnf/vars > > So on a fresh installed Centos 8 system, the symlinks should be there. > > So this mean that the upgrade issue is a left over from the Centos 7 to > Centos 8 in place upgrade operation. On Centos 7, /etc/yum/vars is > a directory, on Centos 8, that's a symlink. The newer version of the > upgrade script take that in account (there is a mv), but not > the one that was used to create the image, and so the yum upgrade in place > fail as it failed when we tried on Centos 8. > > And since that's a unsupported operation, there is no chance to have it > fixed in Centos 8 (or RHEL 8 for that matter), by adding > the right scriptlet in yum. > > > Conclusion > ========== > > Since that mail is already long, my conclusion would be: > > - trust no one, especially unofficial images on AWS marketplace. > > I would love to add the truth and the official images are out there, but I > checked this morning, still not the case. > > -- > Michael Scherer / He/Il/Er/Él > Sysadmin, Community Infrastructure > > > > _______________________________________________ > Gluster-infra mailing list > Gluster-infra@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-infra
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra