Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Fri, May 19, 2017 at 1:46 PM, Chris Murphywrote: > FYI the file system folks are discussing this. It is not just a > problem with XFS it can affect ext4 too. And it's far from clear the > fs folks have a solution that won't cause worse problems. OK so this is what I got out of those conversations sync() -> write data to disk, write metadata to log FIFREEZE() -> sync() and write log contents to fs. unmount() -> sync() write log contents to fs. reboot() -> sync() and reboot. Only on non-journaled file systems does sync() mean write data to disk, write metadata to fs, because there's no log. sync() only makes the file system crash safe. It's doesn't mean the bootloader can find files: configuration, kernel, initramfs, if they are only sync()'d because the bootloader has no idea how to read the log. And the fs itself isn't up to date because the log is dirty. The most central blame here goes to the bootloader: specifically that which makes changes to the bootloader configuration in a manner that (pretty much) guarantees the bootloader proper (the binary that executes after POST) will not be able to find either the old or new configuration. At the least if it found the old configuration, it would boot the old kernel and initramfs, which would then cause the journal to be replayed, the file system updated, and on next boot, the new configuration, kernel, and initramfs would get used. Because the bootloader has a special requirement, since it cannot read dirty logs, the thing making bootloader related changes needs to make sure that its updates are not merely crash safe, but are actually fully committed to the file system. That requires fsfreeze. That implicates grub-mkconfig (for GRUB), grubby (not related to GRUB, used on Red Hat, CentOS, Fedora systems), and myriad kernel package scripts that modify bootloader configurations, kernels, and initramfs out in the wild. The first two: grub-mkconfig and grubby, probably represent a fairly good chunk of deployments. But there's still a bunch of non-Red Hat systems that do not use GRUB, and do not use grubby, they depend on the kernel package post-install scripts to make bootloader changes, and that is what would need to do fsfreeze. Or systemd can help pick up some of the slack and figure out a way to either make sure one of three things definitely happens before a reboot: umount, remount-ro, or fsfreeze. Of course, not every distro uses systemd, and so only solving the central problem is a solution on those distros, but in either case that's not systemd's responsibility. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
FYI the file system folks are discussing this. It is not just a problem with XFS it can affect ext4 too. And it's far from clear the fs folks have a solution that won't cause worse problems. http://www.spinics.net/lists/linux-fsdevel/msg111058.html Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 20:20, Chris Murphy (li...@colorremedies.com) wrote: > 4. Systemd for not enforcing limited kill exemption to those running > from initramfs, i.e. ignore kill exemption if the program is running > other than initramfs. Well, we are not the police, and we do kill everything by default, even though we have this explicit, privileged opt-out of this. If people misuse it, then I am pretty sure it's on them, not us... That said, I will subscribe to the request that systemd's shutdown logic should go the safest way possible, and hence I am fine with calling the generic FIFREEZE+FITHAW ioctls one after the other, if that helps, even though I think this is really broken API. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 19:30, Chris Murphy (li...@colorremedies.com) wrote: > >> Remember, all of this is because there *is* software that does the wrong > >> thing, and it *is* possible for software to hang and be unkillable. It > >> would > >> be good for systemd to do the right thing even in the presence of that kind > >> of software. > > > > Yeah, we do what we can. > > > > But I seriously doubt FIFREEZE will make things better. It's just > > going to make shutdowns hang every now and then. > > My understanding is freeze isn't ignorable, it's expressly for the use > case when the disk has active processing writing and the fs must be > made completely consistent, e.g. prior to taking a snapshot. The thaw > immediately following freeze would prevent any shutdown hang. > > The point of freeze/thaw is it will cause the file system metadata > that grub depends on to know where the new grub.cfg is located, to get > committed to disk prior to reboot. If some process is still hanging > around with an open write, it doesn't really matter. As mentioned: if you prep a patch that adds FIFREEZE+FITHAW when we remount stuff read-only, then I'd merge it, even though I think the kernel APIs for this are really broken, and it would be much preferably having a proper API for this, either exposed via the well-understood sync() syscall, or through a new ioctl, if they must. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, Apr 10, 2017 at 4:44 AM, Lennart Poetteringwrote: > On Mon, 10.04.17 19:07, Michael Chapman (m...@very.puzzling.org) wrote: > >> > So no, "freeze" is not an option. That sounds like a recipe to make >> > shutdown hang. We need a sync() that actually does what is documented >> > and sync the file system properly. >> >> sync() is never going to work the way you want it to work. Let's make >> systemd work correctly for the systems we have today, not some hypothetical >> system of the future. > > It works the way I want on vfat, ext2. The problem you are having is > specific to XFS, no? ext3 and ext4 are dirty also after doing updates; it's just not causing boot failure, but during startup, fsck is fixing things. Btrfs doesn't complain, but btrfs-debug-tree immediately after the offline update reboot (without mounting), compared to btrfs-debug-tree following a mount (but not booting, reading, or modifying anything) shows considerable changes are made to the file system just due to the mount. So something was left stale, and I'm guessing it was sync() causing things to get stuffed into the log tree; which is then cleaned up at next mount. It's not corruption, it's not even really dirty in Btrfs semantics, but functionally I guess you'd say it was fixing itself back up, per design. And BTW, this is in the XFS list thread, but it's not merely the grub.cfg that's missing in action. It's a large pile of files including the kernel and initramfs. None of those new files exist yet from the perspective of the bootloader. > >> The filesystem developers have good reasons for sync()'s current behaviour. >> I can only point out again that the way they've designed it does *not* lose >> or corrupt data: all synced data is available as soon as the filesystem >> journals have been flushed. We have to explicitly flush the journals >> ourselves, one way or another, to ensure that GRUB and other >> not-fully-Linux-compatible filesystem implementations work correctly. > > The data *is* lost from the perspective of a boot loader. And given > that /boot is pretty much exclusively about boot loading, that's kinda > major. Right. So let's play the blame game for a sec: 1. The kernel update package is most responsible for the change in boot state. It's changing kernel, modules, initramfs, and the bootloader configuration file. So it could be argued, this is the thing that should do freeze/thaw to make certain the bootloader will still be happy at next boot. 2. Bootloader has no fallback. The bootloader configuration is modified in a non-atomic way. In a sense, we should have bootloader.old and bootloader.new and use preferably the new one but if not found use the old (unmodifed) one. At the least, we get a normal boot with the old configuration and kernel, the kernel code cleans up the file system so now the next boot has the updated kernel and bootloader config. 3. Blame the thing that prevents umount and remount-ro: in the example case it's plymouth. 4. Systemd for not enforcing limited kill exemption to those running from initramfs, i.e. ignore kill exemption if the program is running other than initramfs. 5. The OS installer. It might very well be we've passed the point where it's safe for /boot to be a directory on rootfs. If almost anything can someday pin the file system and prevent umount or remount-ro, and thereby make kernel, initramfs, and bootloader config file changes invisible to the bootloader - that's a good reason to separate those files from a pinned file system. This bug is interesting because all of these are valid to blame. But which is the most convincing? It's sortof difficult. And in the end, it might be the least to blame is the the best position to just clobber the problem, preventing it from happening for all use cases. > > Note that these weird XFS semantics are not only a problem on systemd > btw: they are much worse on sysvinit and other simpler init systems, > since they generally don't have the kill/umount/remount/detach loop we > have, and don't support transitioning back into the initrd for > complete detaching/umounting of the root fs either. > > Hence, any claims by the xfs folks that systemd doesn't disassemble > things the right way is very wrong: systemd is certainly the one > implementation that has a better chance to keep xfs sane than any > other... Yes, I think that assertion made on the XFS list by one developer is unconvincing. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, Apr 10, 2017 at 3:04 AM, Lennart Poetteringwrote: > This is specifically the case that happened for Plymouth: the binary > probably got updated, hence the process in memory references a deleted > file, which blocks the read-only remounting, in which case we can't do > anything, and sync and remount. In my reproduce case, the offline update contained only kernel, kernel-core, and kernel-modules packages. This triggers grubby to do the modification on the grub.cfg which happens to be on /boot/grub2 on XFS. Plymouth was definitely not being updated. >> Remember, all of this is because there *is* software that does the wrong >> thing, and it *is* possible for software to hang and be unkillable. It would >> be good for systemd to do the right thing even in the presence of that kind >> of software. > > Yeah, we do what we can. > > But I seriously doubt FIFREEZE will make things better. It's just > going to make shutdowns hang every now and then. My understanding is freeze isn't ignorable, it's expressly for the use case when the disk has active processing writing and the fs must be made completely consistent, e.g. prior to taking a snapshot. The thaw immediately following freeze would prevent any shutdown hang. The point of freeze/thaw is it will cause the file system metadata that grub depends on to know where the new grub.cfg is located, to get committed to disk prior to reboot. If some process is still hanging around with an open write, it doesn't really matter. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
Am Mon, 10 Apr 2017 13:54:27 +0200 schrieb Lennart Poettering: > On Mon, 10.04.17 13:43, Kai Krakow (hurikha...@gmail.com) wrote: > > > Am Mon, 10 Apr 2017 11:04:45 +0200 > > schrieb Lennart Poettering : > > > [...] > > > > > > Yeah, we do what we can. > > > > > > But I seriously doubt FIFREEZE will make things better. It's just > > > going to make shutdowns hang every now and then. > > > > It could simply thaw the FS again after freeze to somewhat improve > > on that. At least everything that should be flushed is now flushed > > at that point and grub et al should be happy. > > > > But I wonder why filesystems not just flush the journal on > > remount-ro? It may take a while but I think that can be perfectly > > expected when rmounting ro: At least I would expect that this > > forces out all pending writes to the filesystem hence flushing the > > journal. > > Well, the remount-ro doesn't succeed in the case this is all about: > the plymouth process appears to run off the root fs and keeps the > executable pinned, which was deleted because updated, and thus the > kernel will refuse the remount. See other mail. Ah okay, so given that case, a journal flush even isn't attempted, it fails right away. My first idea was that it should flush the journal but can fail anyways. I didn't get that point. Thus my assumption that remount-ro doesn't flush the journal. > > So a final freeze/thaw cycle is probably the only way to go? As it > > specifies what is needed here to be compatible with configurations > > that involve grub on complex filesystems. > > A pair of FIFREEZE+FITHAW are likely to work, but it's frickin' ugly > (see other mails), and I'd certainly prefer if the fs folks would > provide a proper ioctl/syscall for the operation we need. Quite > frankly it doesn't appear like a particularly exotic operation, in > fact the operation we'd need would probably be run much more often > than the operation that FIFREEZE/FITHAW was introduced for... Yes it's ugly and there should be a proper ioctl/syscall for the exact semantics needed. Usually, working around such missing APIs only results in the needed bits never implemented. I totally understand your point. ;-) -- Regards, Kai Replies to list-only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 13:43, Kai Krakow (hurikha...@gmail.com) wrote: > Am Mon, 10 Apr 2017 11:04:45 +0200 > schrieb Lennart Poettering: > > > > Remember, all of this is because there *is* software that does the > > > wrong thing, and it *is* possible for software to hang and be > > > unkillable. It would be good for systemd to do the right thing even > > > in the presence of that kind of software. > > > > Yeah, we do what we can. > > > > But I seriously doubt FIFREEZE will make things better. It's just > > going to make shutdowns hang every now and then. > > It could simply thaw the FS again after freeze to somewhat improve on > that. At least everything that should be flushed is now flushed at that > point and grub et al should be happy. > > But I wonder why filesystems not just flush the journal on remount-ro? > It may take a while but I think that can be perfectly expected when > rmounting ro: At least I would expect that this forces out all pending > writes to the filesystem hence flushing the journal. Well, the remount-ro doesn't succeed in the case this is all about: the plymouth process appears to run off the root fs and keeps the executable pinned, which was deleted because updated, and thus the kernel will refuse the remount. See other mail. > So a final freeze/thaw cycle is probably the only way to go? As it > specifies what is needed here to be compatible with configurations that > involve grub on complex filesystems. A pair of FIFREEZE+FITHAW are likely to work, but it's frickin' ugly (see other mails), and I'd certainly prefer if the fs folks would provide a proper ioctl/syscall for the operation we need. Quite frankly it doesn't appear like a particularly exotic operation, in fact the operation we'd need would probably be run much more often than the operation that FIFREEZE/FITHAW was introduced for... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
Am Mon, 10 Apr 2017 11:04:45 +0200 schrieb Lennart Poettering: > > Remember, all of this is because there *is* software that does the > > wrong thing, and it *is* possible for software to hang and be > > unkillable. It would be good for systemd to do the right thing even > > in the presence of that kind of software. > > Yeah, we do what we can. > > But I seriously doubt FIFREEZE will make things better. It's just > going to make shutdowns hang every now and then. It could simply thaw the FS again after freeze to somewhat improve on that. At least everything that should be flushed is now flushed at that point and grub et al should be happy. But I wonder why filesystems not just flush the journal on remount-ro? It may take a while but I think that can be perfectly expected when rmounting ro: At least I would expect that this forces out all pending writes to the filesystem hence flushing the journal. Tho, readonly mounts do not guarantee the filesystem not modifying the underlying storage device. For example, btrfs can modify the storage even when mounting an unmounted fs in ro mode. It guarantees readonly from user-space perspective - and I think that's totally on par with the specs of "mount -o ro". So a final freeze/thaw cycle is probably the only way to go? As it specifies what is needed here to be compatible with configurations that involve grub on complex filesystems. Then, what's with underlying cache infrastructures like BBU-supported RAID caches? We had systems that failed on reboot because the BBU was in relearning cycle at reboot and the controller thus refused to replay the write-cache during POST and instead discarded it. That can really create you a big mess, btw. Tho, I think that's a controller bug: The writeback wasn't set to always writeback but only when it's safe. But this suggests that the reboot code should even force some cache flush for those components. Taken everything into account it boils down to eventually not using grub on XFS but only simple filesystems, or depend on ESP only for booting. Everything else only means that systemd (and other init systems) have to invent a huge complex mess to fix everything that isn't done right by other involved software. -- Regards, Kai Replies to list-only preferred. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 19:07, Michael Chapman (m...@very.puzzling.org) wrote: > > So no, "freeze" is not an option. That sounds like a recipe to make > > shutdown hang. We need a sync() that actually does what is documented > > and sync the file system properly. > > sync() is never going to work the way you want it to work. Let's make > systemd work correctly for the systems we have today, not some hypothetical > system of the future. It works the way I want on vfat, ext2. The problem you are having is specific to XFS, no? > The filesystem developers have good reasons for sync()'s current behaviour. > I can only point out again that the way they've designed it does *not* lose > or corrupt data: all synced data is available as soon as the filesystem > journals have been flushed. We have to explicitly flush the journals > ourselves, one way or another, to ensure that GRUB and other > not-fully-Linux-compatible filesystem implementations work correctly. The data *is* lost from the perspective of a boot loader. And given that /boot is pretty much exclusively about boot loading, that's kinda major. Note that these weird XFS semantics are not only a problem on systemd btw: they are much worse on sysvinit and other simpler init systems, since they generally don't have the kill/umount/remount/detach loop we have, and don't support transitioning back into the initrd for complete detaching/umounting of the root fs either. Hence, any claims by the xfs folks that systemd doesn't disassemble things the right way is very wrong: systemd is certainly the one implementation that has a better chance to keep xfs sane than any other... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10 Apr 2017, Lennart Poettering wrote: On Mon, 10.04.17 19:38, Michael Chapman (m...@very.puzzling.org) wrote: On Mon, 10 Apr 2017, Lennart Poettering wrote: On Mon, 10.04.17 18:45, Michael Chapman (m...@very.puzzling.org) wrote: On Mon, 10 Apr 2017, Lennart Poettering wrote: On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: Don't forget, they've provided an interface for software to use if it needs more than the guarantees provided by sync. Informally speaking, the FIFREEZE ioctl is intended to place a filesystem into a "fully consistent" state, not just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX really doesn't guarantee anything with sync.) FIFREEZE does considerably more than what you suggest: it also pauses all further changes until FITHAW is called. And that's semantics we really cannot have. If systemd is just about to call reboot(2), why does it matter? Well, in the general case we don't actually call reboot(), because we instead transition back into the initrd, which then eventually calls that. At least that's what happens on the major general purpose distros that have an initrd that does that (for example: Fedora/RHEL with Dracut). If it's not systemd _inside_ the initrd calling reboot(2), then there's nothing systemd can do about it. The initrd usually doesn't run a systemd environment anymore, PID 1 is usually a shell script of some kind. It might use our "reboot" binary and call it with "-ff" (which means it's really just a pure reboot() wrapper), but we don't do a umount/kill/detach spree in that case. Or in other words: if they do use our reboot utility then they use it in pure sysvinit compat mode, where it won't do more than sync() + reboot(), exactly the same way as sysvinit() did. OK, given that there's really no point in pursuing this from the systemd end. I am sorry, but just making all accesses hang is just broken. That can't work. I do think we should attempt to remount readonly before doing the FIFREEZE. I thought systemd did that, but it appears that it does not. A readonly remount will do what we want so long as no remaining processes have any files opened for writing on the filesystem. The FIFREEZE would only be necessary when the remount fails. We remount everything read-only we can if we cannot unmount something. Ah, I see the code for that now. I was looking for something after the umount call (specifically, if umount failed), not before. Well, the scheme works like this: we kill, umount, remount, detach in a loop until nothing changes anymore. It's a primitive but robust way, to deal with stacked storage, where running processes might pin file systems, which might in turn pin devices, which might in turn pin backend userspace services, and so on... Hence, yes, we do the umount first, and the remount second, but then we'll try another umount again and another remount, until this stops being fruitful. But do note that we can't do that in all cases. Most prominently: consider a process that is running from an executable that has been updated on disk (specifically: whose binary got deleted because it was replaced by a newer version). This process will keep the file pinned, and will block all read-only remounts, as the kernel wants to mark the file properly deleted first, but it can't since the process is keeping it pinned. This is specifically the case that happened for Plymouth: the binary probably got updated, hence the process in memory references a deleted file, which blocks the read-only remounting, in which case we can't do anything, and sync and remount. OK, so how about this. _After_ the unmount-everything loop we do a freeze + thaw for each remaining filesystem, one filesystem at a time. That won't permanently block processes that are still writing to the filesystems (and why would they be?!), it will ensure that all filesystems' journals are fully flushed (which will make GRUB and other OSs happy), and it won't block the kernel from doing any kind of reboot()-time cleanups you were talking about earlier. Well, I figure that might work, but it's also fricking ugly: it feels like booking a plane ticket that includes free airplane food, just because you are hungry: you get considerably more than just the food, and you have to sit uncomfortably for too long, end up where you didn't want to go, and the food is quite awful too. ("Chicken or pasta?") I'd prefer if the file system folks would simply provide sane semantics here, and provide an fsync()-style syscall or ioctl that does what is needed here. Perhaps we might be able to convince them to make reboot() a full "unmount all remaining filesystems" operation. To be honest, I'm a little surprised it isn't... but I suppose it's got all the same problems with ordering between filesystems within the kernel itself. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 19:38, Michael Chapman (m...@very.puzzling.org) wrote: > On Mon, 10 Apr 2017, Lennart Poettering wrote: > > On Mon, 10.04.17 18:45, Michael Chapman (m...@very.puzzling.org) wrote: > > > > > On Mon, 10 Apr 2017, Lennart Poettering wrote: > > > > On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: > > > > > > > > > Don't forget, they've provided an interface for software to use if it > > > > > needs > > > > > more than the guarantees provided by sync. Informally speaking, the > > > > > FIFREEZE > > > > > ioctl is intended to place a filesystem into a "fully consistent" > > > > > state, not > > > > > just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX > > > > > really doesn't guarantee anything with sync.) > > > > > > > > FIFREEZE does considerably more than what you suggest: it also pauses > > > > all further changes until FITHAW is called. And that's semantics we > > > > really cannot have. > > > > > > If systemd is just about to call reboot(2), why does it matter? > > > > Well, in the general case we don't actually call reboot(), because we > > instead transition back into the initrd, which then eventually calls > > that. At least that's what happens on the major general purpose > > distros that have an initrd that does that (for example: Fedora/RHEL > > with Dracut). > > If it's not systemd _inside_ the initrd calling reboot(2), then there's > nothing systemd can do about it. The initrd usually doesn't run a systemd environment anymore, PID 1 is usually a shell script of some kind. It might use our "reboot" binary and call it with "-ff" (which means it's really just a pure reboot() wrapper), but we don't do a umount/kill/detach spree in that case. Or in other words: if they do use our reboot utility then they use it in pure sysvinit compat mode, where it won't do more than sync() + reboot(), exactly the same way as sysvinit() did. > > I am sorry, but just making all accesses hang is just broken. That > > can't work. > > > > > I do think we should attempt to remount readonly before doing the > > > FIFREEZE. > > > I thought systemd did that, but it appears that it does not. A readonly > > > remount will do what we want so long as no remaining processes have any > > > files opened for writing on the filesystem. The FIFREEZE would only be > > > necessary when the remount fails. > > > > We remount everything read-only we can if we cannot unmount > > something. > > Ah, I see the code for that now. I was looking for something after the > umount call (specifically, if umount failed), not before. Well, the scheme works like this: we kill, umount, remount, detach in a loop until nothing changes anymore. It's a primitive but robust way, to deal with stacked storage, where running processes might pin file systems, which might in turn pin devices, which might in turn pin backend userspace services, and so on... Hence, yes, we do the umount first, and the remount second, but then we'll try another umount again and another remount, until this stops being fruitful. > > But do note that we can't do that in all cases. Most > > prominently: consider a process that is running from an executable > > that has been updated on disk (specifically: whose binary got deleted > > because it was replaced by a newer version). This process will keep > > the file pinned, and will block all read-only remounts, as the kernel > > wants to mark the file properly deleted first, but it can't since the > > process is keeping it pinned. > > > > This is specifically the case that happened for Plymouth: the binary > > probably got updated, hence the process in memory references a deleted > > file, which blocks the read-only remounting, in which case we can't do > > anything, and sync and remount. > > OK, so how about this. _After_ the unmount-everything loop we do a freeze + > thaw for each remaining filesystem, one filesystem at a time. That won't > permanently block processes that are still writing to the filesystems (and > why would they be?!), it will ensure that all filesystems' journals are > fully flushed (which will make GRUB and other OSs happy), and it won't block > the kernel from doing any kind of reboot()-time cleanups you were talking > about earlier. Well, I figure that might work, but it's also fricking ugly: it feels like booking a plane ticket that includes free airplane food, just because you are hungry: you get considerably more than just the food, and you have to sit uncomfortably for too long, end up where you didn't want to go, and the food is quite awful too. ("Chicken or pasta?") I'd prefer if the file system folks would simply provide sane semantics here, and provide an fsync()-style syscall or ioctl that does what is needed here. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10 Apr 2017, Lennart Poettering wrote: On Mon, 10.04.17 17:21, Michael Chapman (m...@very.puzzling.org) wrote: Or, I think, when pivoting back to the shutdown-initramfs. (Though then you also need the shutdown-initramfs to run `fsfreeze`, I guess?) No, I don't think it should be done then. If a filesystem is still in use, then doing a freeze there would likely make any processes still using it unkillable. And doing a freeze followed by a thaw doesn't gain us much, we'd still need to do another freeze at the end of shutdown-initramfs. Hmm? Are you saing that on XFS you might even see corruption on files that weren't accessed for write since the last freeze if you forget to freeze when shutting down? No, I'm not saying that at all. I mean, unless the initrd hooks modify the boot loader having done FIFREEZE once sounds safe enough, no? Mantas Mikulėnas suggested doing a freeze on the pivot back to the shutdown-initramfs. But that's no good: any argv[0][0] == '@' processes could be writing to the filesystems then. This is why I've been stressing that the filesystem freezes (and thaws too, if necessary) should only happen right before the reboot(2) syscall.___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10 Apr 2017, Lennart Poettering wrote: On Mon, 10.04.17 18:45, Michael Chapman (m...@very.puzzling.org) wrote: On Mon, 10 Apr 2017, Lennart Poettering wrote: On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: Don't forget, they've provided an interface for software to use if it needs more than the guarantees provided by sync. Informally speaking, the FIFREEZE ioctl is intended to place a filesystem into a "fully consistent" state, not just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX really doesn't guarantee anything with sync.) FIFREEZE does considerably more than what you suggest: it also pauses all further changes until FITHAW is called. And that's semantics we really cannot have. If systemd is just about to call reboot(2), why does it matter? Well, in the general case we don't actually call reboot(), because we instead transition back into the initrd, which then eventually calls that. At least that's what happens on the major general purpose distros that have an initrd that does that (for example: Fedora/RHEL with Dracut). If it's not systemd _inside_ the initrd calling reboot(2), then there's nothing systemd can do about it. Moreover, on the kernel side, various bits and pieces hook into the reboot() syscall too and do last-minute stuff before going down. Are you sure that if you have a complex storage setup (let's say DM on top of loop on top of XFS on top of something else), that having frozen a lower-level file system is not going to make the kernel itself pretty unhappy if it then tries to clean up something further above? OK, that is a good point. I am sorry, but just making all accesses hang is just broken. That can't work. I do think we should attempt to remount readonly before doing the FIFREEZE. I thought systemd did that, but it appears that it does not. A readonly remount will do what we want so long as no remaining processes have any files opened for writing on the filesystem. The FIFREEZE would only be necessary when the remount fails. We remount everything read-only we can if we cannot unmount something. Ah, I see the code for that now. I was looking for something after the umount call (specifically, if umount failed), not before. But do note that we can't do that in all cases. Most prominently: consider a process that is running from an executable that has been updated on disk (specifically: whose binary got deleted because it was replaced by a newer version). This process will keep the file pinned, and will block all read-only remounts, as the kernel wants to mark the file properly deleted first, but it can't since the process is keeping it pinned. This is specifically the case that happened for Plymouth: the binary probably got updated, hence the process in memory references a deleted file, which blocks the read-only remounting, in which case we can't do anything, and sync and remount. OK, so how about this. _After_ the unmount-everything loop we do a freeze + thaw for each remaining filesystem, one filesystem at a time. That won't permanently block processes that are still writing to the filesystems (and why would they be?!), it will ensure that all filesystems' journals are fully flushed (which will make GRUB and other OSs happy), and it won't block the kernel from doing any kind of reboot()-time cleanups you were talking about earlier. Note that systemd itself always reexecutes itself on shutdown, to ensure that if itself got updated during runtime we'll stop pinning the old file. Remember, all of this is because there *is* software that does the wrong thing, and it *is* possible for software to hang and be unkillable. It would be good for systemd to do the right thing even in the presence of that kind of software. Yeah, we do what we can. But I seriously doubt FIFREEZE will make things better. It's just going to make shutdowns hang every now and then. To be honest, I think having systems unbootable is a more serious problem than having shutdowns hang. But I also think with a freeze _and_ a thaw for each filesystem, we won't have hangs. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 18:45, Michael Chapman (m...@very.puzzling.org) wrote: > On Mon, 10 Apr 2017, Lennart Poettering wrote: > > On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: > > > > > Don't forget, they've provided an interface for software to use if it > > > needs > > > more than the guarantees provided by sync. Informally speaking, the > > > FIFREEZE > > > ioctl is intended to place a filesystem into a "fully consistent" state, > > > not > > > just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX > > > really doesn't guarantee anything with sync.) > > > > FIFREEZE does considerably more than what you suggest: it also pauses > > all further changes until FITHAW is called. And that's semantics we > > really cannot have. > > If systemd is just about to call reboot(2), why does it matter? Well, in the general case we don't actually call reboot(), because we instead transition back into the initrd, which then eventually calls that. At least that's what happens on the major general purpose distros that have an initrd that does that (for example: Fedora/RHEL with Dracut). Moreover, on the kernel side, various bits and pieces hook into the reboot() syscall too and do last-minute stuff before going down. Are you sure that if you have a complex storage setup (let's say DM on top of loop on top of XFS on top of something else), that having frozen a lower-level file system is not going to make the kernel itself pretty unhappy if it then tries to clean up something further above? I am sorry, but just making all accesses hang is just broken. That can't work. > I do think we should attempt to remount readonly before doing the FIFREEZE. > I thought systemd did that, but it appears that it does not. A readonly > remount will do what we want so long as no remaining processes have any > files opened for writing on the filesystem. The FIFREEZE would only be > necessary when the remount fails. We remount everything read-only we can if we cannot unmount something. But do note that we can't do that in all cases. Most prominently: consider a process that is running from an executable that has been updated on disk (specifically: whose binary got deleted because it was replaced by a newer version). This process will keep the file pinned, and will block all read-only remounts, as the kernel wants to mark the file properly deleted first, but it can't since the process is keeping it pinned. This is specifically the case that happened for Plymouth: the binary probably got updated, hence the process in memory references a deleted file, which blocks the read-only remounting, in which case we can't do anything, and sync and remount. Note that systemd itself always reexecutes itself on shutdown, to ensure that if itself got updated during runtime we'll stop pinning the old file. > Remember, all of this is because there *is* software that does the wrong > thing, and it *is* possible for software to hang and be unkillable. It would > be good for systemd to do the right thing even in the presence of that kind > of software. Yeah, we do what we can. But I seriously doubt FIFREEZE will make things better. It's just going to make shutdowns hang every now and then. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 17:21, Michael Chapman (m...@very.puzzling.org) wrote: > > Or, I think, when pivoting back to the shutdown-initramfs. (Though then you > > also need the shutdown-initramfs to run `fsfreeze`, I guess?) > > No, I don't think it should be done then. If a filesystem is still in use, > then doing a freeze there would likely make any processes still using it > unkillable. And doing a freeze followed by a thaw doesn't gain us much, we'd > still need to do another freeze at the end of shutdown-initramfs. Hmm? Are you saing that on XFS you might even see corruption on files that weren't accessed for write since the last freeze if you forget to freeze when shutting down? I mean, unless the initrd hooks modify the boot loader having done FIFREEZE once sounds safe enough, no? Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10 Apr 2017, Lennart Poettering wrote: On Mon, 10.04.17 16:14, Michael Chapman (m...@very.puzzling.org) wrote: On Mon, 10 Apr 2017, Chris Murphy wrote: On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poetteringwrote: That said, are you sure FIFREEZE is really what we want there? it appears to also pause any further writes to disk (until FITHAW is called). So, I am still puzzled why the file system people think that "sync()" isn't supposed to actually sync things to disk... https://www.spinics.net/lists/linux-xfs/msg05113.html Ah good, Dave actually suggests using a freeze there. A freeze without a corresponding thaw should be OK if it's definitely after all processes have been killed, since we're just about to reboot anyway. (Obviously we'd want to avoid the whole lot when running in a container or when doing kexec.) No, there is no such guarantee. We support initrds that run userspace stuff from the initrd at boot, that stays around in the background is only killed after we transition back into the initrd. And we really don't control what they do, they can do anything they like, access any file they want at any time. We added this primarily to support storage services backing the root file system (think iscsid, nbd, ...), but it actually can be anything that hsa the "feel" of a kernel component in being around since the time before systemd initialiazes until after the time it shut down again, but is actually implemented in userspace. In fact, this is precisely what plymouth is making use of: by marking a process with argv[0][0] = '@' we permit any privileged process to be excluded from the final killing spree, so that it will survive until the initrd shutdown transition. This is precisely why I intend to add it _just before_ the reboot(2) call. Any processes that have survived that far are going to stop running a very short moment later anyway; it doesn't matter if they get hung on a write. Note that I am specifically NOT talking about doing a filesystem freeze on the shutdown-initrd transition. That would be ludicrous. So no, "freeze" is not an option. That sounds like a recipe to make shutdown hang. We need a sync() that actually does what is documented and sync the file system properly. sync() is never going to work the way you want it to work. Let's make systemd work correctly for the systems we have today, not some hypothetical system of the future. The filesystem developers have good reasons for sync()'s current behaviour. I can only point out again that the way they've designed it does *not* lose or corrupt data: all synced data is available as soon as the filesystem journals have been flushed. We have to explicitly flush the journals ourselves, one way or another, to ensure that GRUB and other not-fully-Linux-compatible filesystem implementations work correctly. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10 Apr 2017, Lennart Poettering wrote: On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: Don't forget, they've provided an interface for software to use if it needs more than the guarantees provided by sync. Informally speaking, the FIFREEZE ioctl is intended to place a filesystem into a "fully consistent" state, not just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX really doesn't guarantee anything with sync.) FIFREEZE does considerably more than what you suggest: it also pauses all further changes until FITHAW is called. And that's semantics we really cannot have. If systemd is just about to call reboot(2), why does it matter? I do think we should attempt to remount readonly before doing the FIFREEZE. I thought systemd did that, but it appears that it does not. A readonly remount will do what we want so long as no remaining processes have any files opened for writing on the filesystem. The FIFREEZE would only be necessary when the remount fails. Remember, all of this is because there *is* software that does the wrong thing, and it *is* possible for software to hang and be unkillable. It would be good for systemd to do the right thing even in the presence of that kind of software. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10.04.17 16:14, Michael Chapman (m...@very.puzzling.org) wrote: > On Mon, 10 Apr 2017, Chris Murphy wrote: > > On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering > >wrote: > > > > > That said, are you sure FIFREEZE is really what we want there? it > > > appears to also pause any further writes to disk (until FITHAW is > > > called). > > > > > So, I am still puzzled why the file system people think that "sync()" > > > isn't supposed to actually sync things to disk... > > > > https://www.spinics.net/lists/linux-xfs/msg05113.html > > Ah good, Dave actually suggests using a freeze there. A freeze without a > corresponding thaw should be OK if it's definitely after all processes have > been killed, since we're just about to reboot anyway. (Obviously we'd want > to avoid the whole lot when running in a container or when doing > kexec.) No, there is no such guarantee. We support initrds that run userspace stuff from the initrd at boot, that stays around in the background is only killed after we transition back into the initrd. And we really don't control what they do, they can do anything they like, access any file they want at any time. We added this primarily to support storage services backing the root file system (think iscsid, nbd, ...), but it actually can be anything that hsa the "feel" of a kernel component in being around since the time before systemd initialiazes until after the time it shut down again, but is actually implemented in userspace. In fact, this is precisely what plymouth is making use of: by marking a process with argv[0][0] = '@' we permit any privileged process to be excluded from the final killing spree, so that it will survive until the initrd shutdown transition. So no, "freeze" is not an option. That sounds like a recipe to make shutdown hang. We need a sync() that actually does what is documented and sync the file system properly. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: > Don't forget, they've provided an interface for software to use if it needs > more than the guarantees provided by sync. Informally speaking, the FIFREEZE > ioctl is intended to place a filesystem into a "fully consistent" state, not > just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX > really doesn't guarantee anything with sync.) FIFREEZE does considerably more than what you suggest: it also pauses all further changes until FITHAW is called. And that's semantics we really cannot have. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sun, 09.04.17 22:37, Chris Murphy (li...@colorremedies.com) wrote: > Oh god - that's the opposite direction to go in. There's not even > pretend crash safety with those file systems. If they're dirty, you > must use an fsck to get them back to consistency. Even if the toy fs > support found in firmware will tolerate the inconsistency, who knows > what blocks it actually ends up loading into memory, you can just get > a crash later at the bootloader, or the kernel, or initramfs. That so > much consumer hardware routinely lies about having committed data to > stable media following sync() makes those file systems even less > reliable for this purpose. Once corrupt, the file system has no fail > safe or fallback like a journaled or COW file system. It's busted > until fixed with fsck. Well, note that in a systemd world where systemd manages the ESP there's a pretty good chance the file system stays in a clean state, since we unmount it after after 2s after each write, and only make it available via autofs. So yeah, only in a short time frame around a boot loader update there's a chance for corruption. Which is certainly much better than a corruption on every disk change like on XFS... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10 Apr 2017, Mantas Mikulėnas wrote: On Mon, Apr 10, 2017 at 9:14 AM, Michael Chapmanwrote: On Mon, 10 Apr 2017, Chris Murphy wrote: On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering wrote: That said, are you sure FIFREEZE is really what we want there? it appears to also pause any further writes to disk (until FITHAW is called). So, I am still puzzled why the file system people think that "sync()" isn't supposed to actually sync things to disk... https://www.spinics.net/lists/linux-xfs/msg05113.html Ah good, Dave actually suggests using a freeze there. A freeze without a corresponding thaw should be OK if it's definitely after all processes have been killed, since we're just about to reboot anyway. (Obviously we'd want to avoid the whole lot when running in a container or when doing kexec.) Or, I think, when pivoting back to the shutdown-initramfs. (Though then you also need the shutdown-initramfs to run `fsfreeze`, I guess?) No, I don't think it should be done then. If a filesystem is still in use, then doing a freeze there would likely make any processes still using it unkillable. And doing a freeze followed by a thaw doesn't gain us much, we'd still need to do another freeze at the end of shutdown-initramfs. So I think we should only freeze any still-mounted filesystems *right* before the reboot(2) call. That's the only time it's guaranteed to be safe -- if there's still miraculously some other process hanging around, it's about to disappear anyway. On the topic of XFS filesystem freezing, I just found this slide deck: http://oss.sgi.com/projects/xfs/training/xfs_slides_09_internals.pdf Page 39 is of particular interest: """ sync(2) * XFS implements an optimization to sync(2) of metadata: - XFS will only force the log out, such that any dirty metadata that is incore is written to the log only, the metadata itself is not necessarily written - This is safe, since all change is ondisk - File data is guaranteed too (even barriers) * Log and metadata are written to disk for - freeze/thaw - remount ro - unmount Applications like grub have been bitten in the past, but fixed nowadays """ I'm not sure what it's referring to with GRUB there, but at least this confirms what the filesystem developers' intentions are with the sync(2) call.___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, Apr 10, 2017 at 9:14 AM, Michael Chapmanwrote: > On Mon, 10 Apr 2017, Chris Murphy wrote: > >> On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering >> wrote: >> >> That said, are you sure FIFREEZE is really what we want there? it >>> appears to also pause any further writes to disk (until FITHAW is >>> called). >>> >> >> So, I am still puzzled why the file system people think that "sync()" >>> isn't supposed to actually sync things to disk... >>> >> >> https://www.spinics.net/lists/linux-xfs/msg05113.html >> > > Ah good, Dave actually suggests using a freeze there. A freeze without a > corresponding thaw should be OK if it's definitely after all processes have > been killed, since we're just about to reboot anyway. (Obviously we'd want > to avoid the whole lot when running in a container or when doing kexec.) > Or, I think, when pivoting back to the shutdown-initramfs. (Though then you also need the shutdown-initramfs to run `fsfreeze`, I guess?) -- Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 10 Apr 2017, Chris Murphy wrote: On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poetteringwrote: That said, are you sure FIFREEZE is really what we want there? it appears to also pause any further writes to disk (until FITHAW is called). So, I am still puzzled why the file system people think that "sync()" isn't supposed to actually sync things to disk... https://www.spinics.net/lists/linux-xfs/msg05113.html Ah good, Dave actually suggests using a freeze there. A freeze without a corresponding thaw should be OK if it's definitely after all processes have been killed, since we're just about to reboot anyway. (Obviously we'd want to avoid the whole lot when running in a container or when doing kexec.) I'll try to reproduce the problem (I don't use Plymouth, so I haven't seen it myself yet) and come up with a patch. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sun, Apr 9, 2017 at 2:17 PM, Lennart Poetteringwrote: > On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: > > > Don't forget, they've provided an interface for software to use if it > needs > > more than the guarantees provided by sync. Informally speaking, the > FIFREEZE > > ioctl is intended to place a filesystem into a "fully consistent" state, > not > > just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX > > really doesn't guarantee anything with sync.) > > If FIFREEZE is a generic ioctl() supported by a number of different > file systems I figure it would be much more OK with calling it. > > That said, are you sure FIFREEZE is really what we want there? it > appears to also pause any further writes to disk (until FITHAW is > called). Which isn't really what we are interested in here (note that > we return back to the initrd after the umount spree and it shall be > able to do the rest, and if it actually can do that, then the file > systems should be able to unmount and that usually results in writes > to disk...) > > So, I am still puzzled why the file system people think that "sync()" > isn't supposed to actually sync things to disk... I mean, it appears > the call is pretty much useless and it's traditional usage (which > prominently is in sysvinit before reboot()) appears to be broken by > their behaviour. > > Why bother with sync() at all, if it implies no guarantees? This is > quite frankly bullshit... > > It appears to me that using /boot on a file system whith such broken > sync() semantics is really not a safe thing to do, and people should > probably only use something more reliable, i.e. ext2 or vfat where > sync() actually works correctly... > It does? My /boot is vfat due to UEFI requirements, and it becomes unbootable if you as much as sneeze near it – I've already had to repair it thrice, after a sync and everything. -- Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sun, Apr 09, 2017 at 10:37:36PM -0600, Chris Murphy wrote: > On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poettering >wrote: > > > That said, are you sure FIFREEZE is really what we want there? it > > appears to also pause any further writes to disk (until FITHAW is > > called). > > > So, I am still puzzled why the file system people think that "sync()" > > isn't supposed to actually sync things to disk... > > https://www.spinics.net/lists/linux-xfs/msg05113.html > So the “solution” seems to be adding FIFREEZE/FITHAW ioctls after sync()? -- Tomasz Torcz "Never underestimate the bandwidth of a station xmpp: zdzich...@chrome.plwagon filled with backup tapes." -- Jim Gray ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sun, Apr 9, 2017 at 5:17 AM, Lennart Poetteringwrote: > That said, are you sure FIFREEZE is really what we want there? it > appears to also pause any further writes to disk (until FITHAW is > called). > So, I am still puzzled why the file system people think that "sync()" > isn't supposed to actually sync things to disk... https://www.spinics.net/lists/linux-xfs/msg05113.html The question isn't directly answered in there (it is part of the thread on this very subject though). My guess at is that sync() predates journaled file systems, and the expectations of sync() for a journaled file system are basically just crash consistency, not all metadata is on disk. Fully writing all metadata is expensive; as is checking fixing it with an offline fsck. Both of those are reasons why we have journaled filesystems. If sync() required all fs metadata to commit to stable media it would make file systems dog slow. Every damn thing is doing fsync's now. Before Btrfs had a log tree, workloads with many fsyncs would hang the file system and the entire workfload as well, so my guess is sync() meaning all fs metadata is committed on ext4 and XFS would mean massive performance hits that no one would be happy about. > > Why bother with sync() at all, if it implies no guarantees? This is > quite frankly bullshit... > > It appears to me that using /boot on a file system whith such broken > sync() semantics is really not a safe thing to do, and people should > probably only use something more reliable, i.e. ext2 or vfat where > sync() actually works correctly... Oh god - that's the opposite direction to go in. There's not even pretend crash safety with those file systems. If they're dirty, you must use an fsck to get them back to consistency. Even if the toy fs support found in firmware will tolerate the inconsistency, who knows what blocks it actually ends up loading into memory, you can just get a crash later at the bootloader, or the kernel, or initramfs. That so much consumer hardware routinely lies about having committed data to stable media following sync() makes those file systems even less reliable for this purpose. Once corrupt, the file system has no fail safe or fallback like a journaled or COW file system. It's busted until fixed with fsck. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sat, 8 Apr 2017, Chris Murphy wrote: > On Tue, Apr 4, 2017 at 11:55 AM, Andrei Borzenkovwrote: > > grub2 is not limited to 640KiB. Actually it will actively avoid using > > low memory. It switches to protected mode as the very first thing and > > can use up to 4GiB (and even this probably can be lifted on 64 bit > > platform). The real problem is the fact that grub is read-only so every > > time you access file on journaled partition it will need to replay > > journal again from scratch. This will likely be painfully slow (I > > remember that grub legacy on reiser needed couple of minutes to read > > kernel and much more to read initrd, and that was when both were smaller > > than now). > > OK well that makes more sense; but yeah it still sounds like journal > replay is a non-starter. The entire fs metadata would have to be read > into memory and create something like a RAM based rw snapshot which is > backed by the ro disk version as origin, and then play the log against > the RAM snapshot. That could be faster than constantly replaying the > journal from scratch for each file access. But still - sounds overly > complicated. > > I think this qualifies as "Doctor, it hurt when I do this." And the > doctor says, "So don't do that." And I'm referring to Plymouth > exempting itself from kill while also not running from initramfs. So > I'll kindly make the case with Plymouth folks to stop pressing this > particular hurt me button. > > But hey, pretty cool bug. Not often is it the case you find such an > old bug so easily reproducible but near as I can tell only one person > was hitting it until I tried to reproduce it. > I too was hit by this bug on one of my systems. But what I did is that I just removed all plymouth rpms and everything was good form that moment on. Holger ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sun, 09.04.17 10:11, Michael Chapman (m...@very.puzzling.org) wrote: > Don't forget, they've provided an interface for software to use if it needs > more than the guarantees provided by sync. Informally speaking, the FIFREEZE > ioctl is intended to place a filesystem into a "fully consistent" state, not > just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX > really doesn't guarantee anything with sync.) If FIFREEZE is a generic ioctl() supported by a number of different file systems I figure it would be much more OK with calling it. That said, are you sure FIFREEZE is really what we want there? it appears to also pause any further writes to disk (until FITHAW is called). Which isn't really what we are interested in here (note that we return back to the initrd after the umount spree and it shall be able to do the rest, and if it actually can do that, then the file systems should be able to unmount and that usually results in writes to disk...) So, I am still puzzled why the file system people think that "sync()" isn't supposed to actually sync things to disk... I mean, it appears the call is pretty much useless and it's traditional usage (which prominently is in sysvinit before reboot()) appears to be broken by their behaviour. Why bother with sync() at all, if it implies no guarantees? This is quite frankly bullshit... It appears to me that using /boot on a file system whith such broken sync() semantics is really not a safe thing to do, and people should probably only use something more reliable, i.e. ext2 or vfat where sync() actually works correctly... Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Sun, 9 Apr 2017, Chris Murphy wrote: On Tue, Apr 4, 2017 at 11:55 AM, Andrei Borzenkovwrote: 03.04.2017 07:56, Chris Murphy пишет: On Thu, Mar 30, 2017 at 6:07 AM, Michael Chapman wrote: I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going to say "the metadata _is_ synced, it's in the journal". And it's hard to argue that. After all, the filesystem will be perfectly valid the next time it is mounted, after the journal has been replayed, and it will contain all data written prior to the sync call. It did exactly what the manpage says it does. That's their position. Also, the same file system dirtiness and journal replay is needed on ext4. The sample size is too small to say categorically that the same problem can't happen on ext4 in the same situation. Maybe the grub.cfg is readable, but maybe the kernel isn't, or the initramfs, or something else. Yes, I have seen the same on ext4 which prompted me to play with journal replay code. Unfortunately I do not know how to reliably trigger this condition. I can reliably trigger a dirty ext4 or XFS file system 100% of the time with all recent Fedora installations when doing an offline update. What's very non-deterministic is how this dirtiness will manifest. Filesystems folks basically live in an alternate reality where the farther in time a file system is from mkfs time, the more non-deterministic the file system behaves. *shrug* They don't expect their filesystems to be used except through their own filesystem code. It is perfectly deterministic behaviour when their filesystem code is used. Their logic seems _very_ reasonable to me. Don't forget, they've provided an interface for software to use if it needs more than the guarantees provided by sync. Informally speaking, the FIFREEZE ioctl is intended to place a filesystem into a "fully consistent" state, not just a "fully recoverable" state. (Formally it's all a bit hazy: POSIX really doesn't guarantee anything with sync.) Currently systemd calls sync at shutdown. It doesn't need to do that; it could have just assumed all other software is written correctly. It calls sync as a courtesy to that other software. I really do think systemd ought to freeze the filesystem at the same time, for _exactly the same reasons_. That will solve this Plymouth problem, but it will also solve every other software that somebody might run (possibly accidentally, possibly not) during late shutdown. This problem doesn't just affect GRUB, it could affect users of other operating systems too. I was speaking to somebody who runs OpenBSD. Apparently OpenBSD doesn't have an Ext3 driver, only an Ext2 one, so it is somewhat common practice to use an Ext3 filesystem on Linux but mount it as Ext2 on OpenBSD. That can only work correctly if the filesystem's journal is completely flushed. systemd is the only thing that can do this reliably, since it's the only thing running just before the reboot call.___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Tue, Apr 4, 2017 at 11:55 AM, Andrei Borzenkovwrote: > 03.04.2017 07:56, Chris Murphy пишет: >> On Thu, Mar 30, 2017 at 6:07 AM, Michael Chapman >> wrote: >> >>> I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going >>> to say "the metadata _is_ synced, it's in the journal". And it's hard to >>> argue that. After all, the filesystem will be perfectly valid the next time >>> it is mounted, after the journal has been replayed, and it will contain all >>> data written prior to the sync call. It did exactly what the manpage says it >>> does. >> >> That's their position. >> >> Also, the same file system dirtiness and journal replay is needed on >> ext4. The sample size is too small to say categorically that the same >> problem can't happen on ext4 in the same situation. Maybe the grub.cfg >> is readable, but maybe the kernel isn't, or the initramfs, or >> something else. >> > > Yes, I have seen the same on ext4 which prompted me to play with journal > replay code. Unfortunately I do not know how to reliably trigger this > condition. I can reliably trigger a dirty ext4 or XFS file system 100% of the time with all recent Fedora installations when doing an offline update. What's very non-deterministic is how this dirtiness will manifest. Filesystems folks basically live in an alternate reality where the farther in time a file system is from mkfs time, the more non-deterministic the file system behaves. *shrug* > >> >>> The problem here seems to be that GRUB is an incomplete XFS implementation, >>> one which doesn't know about XFS journalling. It may be a good argument XFS >>> shouldn't be used for /boot... but the issue can really arise with just >>> about any other journalled filesystems, like Ext3/4. >> >> I wondered about it at the start, and asked about it on the XFS list >> in the first post about the problem. The developers nearly died >> laughing at the idea of doing journal replay in 640KiB of memory. They >> said categorically it's not possible. >> > > grub2 is not limited to 640KiB. Actually it will actively avoid using > low memory. It switches to protected mode as the very first thing and > can use up to 4GiB (and even this probably can be lifted on 64 bit > platform). The real problem is the fact that grub is read-only so every > time you access file on journaled partition it will need to replay > journal again from scratch. This will likely be painfully slow (I > remember that grub legacy on reiser needed couple of minutes to read > kernel and much more to read initrd, and that was when both were smaller > than now). OK well that makes more sense; but yeah it still sounds like journal replay is a non-starter. The entire fs metadata would have to be read into memory and create something like a RAM based rw snapshot which is backed by the ro disk version as origin, and then play the log against the RAM snapshot. That could be faster than constantly replaying the journal from scratch for each file access. But still - sounds overly complicated. I think this qualifies as "Doctor, it hurt when I do this." And the doctor says, "So don't do that." And I'm referring to Plymouth exempting itself from kill while also not running from initramfs. So I'll kindly make the case with Plymouth folks to stop pressing this particular hurt me button. But hey, pretty cool bug. Not often is it the case you find such an old bug so easily reproducible but near as I can tell only one person was hitting it until I tried to reproduce it. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Thu, Mar 30, 2017 at 6:07 AM, Michael Chapmanwrote: > I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going > to say "the metadata _is_ synced, it's in the journal". And it's hard to > argue that. After all, the filesystem will be perfectly valid the next time > it is mounted, after the journal has been replayed, and it will contain all > data written prior to the sync call. It did exactly what the manpage says it > does. That's their position. Also, the same file system dirtiness and journal replay is needed on ext4. The sample size is too small to say categorically that the same problem can't happen on ext4 in the same situation. Maybe the grub.cfg is readable, but maybe the kernel isn't, or the initramfs, or something else. > The problem here seems to be that GRUB is an incomplete XFS implementation, > one which doesn't know about XFS journalling. It may be a good argument XFS > shouldn't be used for /boot... but the issue can really arise with just > about any other journalled filesystems, like Ext3/4. I wondered about it at the start, and asked about it on the XFS list in the first post about the problem. The developers nearly died laughing at the idea of doing journal replay in 640KiB of memory. They said categorically it's not possible. > As Mantas Mikulėnas points out, the FIFREEZE ioctl is supported wherever > systemd is, and it's not just XFS-specific. I think it'd be smartest just to > use it because it's there, it's cheap, and it can't make things worse. I think getting mount/umount exit codes reported in the journal when systemd.log_level=debug should be a higher priority. We really ought to find out exactly what's going on, so we don't have to speculate, and I think it's handy to have anyway if it's not a PITA to implement. I've retested after removing plymouth, and the problem isn't reproducible; I've nominated the plymouth non-killable behavior as a Fedora 26 blocker. So it should get fixed and upstreamed. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Thu, 30 Mar 2017, Lennart Poettering wrote: [...] I am sorry, but XFS is really broken here. All init systems since time began kinda did the same thing when shutting down: a) try to unmount all fs that can be unmounted b) for the remaining ones, try to remount ro (the root fs usually qualifies) c) sync() d) reboot() That's how sysvinit does it, how Upstart does it, and systemd does it the same way. (Well, if the initrd supports it we go one step further though, and optionally pivot back to the initrd which can then unmount the root file system, too. That's a systemd innovation however, and only supported on initrd systems where the initrd supports it) If the XFS devs think that the sync() before reboot() can be partially ignored, then I am sorry for them, but that makes XFS pretty much incompatible with every init system in existence. Or to say this differently: if they expect us to invoke some magic per-filesystem ioctl() before reboot(), then that's nonsense. No init system calls that, and I am strongly against such hacks. They should just fix their APIs. Moreover, the man page of sync() is pretty clear on this: "sync() causes all pending modifications to file system metadata and cached file data to be written to the underlying filesystems." It explicitly mentions metadata. Any way you turn it, the XFS folks are just confused if they really claim sync() doesn't have to sync metadata. History says differently, and so does the man page documentation. I am not a filesystem developer (IANAFD?), but I'm pretty sure they're going to say "the metadata _is_ synced, it's in the journal". And it's hard to argue that. After all, the filesystem will be perfectly valid the next time it is mounted, after the journal has been replayed, and it will contain all data written prior to the sync call. It did exactly what the manpage says it does. The problem here seems to be that GRUB is an incomplete XFS implementation, one which doesn't know about XFS journalling. It may be a good argument XFS shouldn't be used for /boot... but the issue can really arise with just about any other journalled filesystems, like Ext3/4. As Mantas Mikulėnas points out, the FIFREEZE ioctl is supported wherever systemd is, and it's not just XFS-specific. I think it'd be smartest just to use it because it's there, it's cheap, and it can't make things worse. -- Michael___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Wed, 22.03.17 11:05, Chris Murphy (li...@colorremedies.com) wrote: > > Result code of "remount ro" is not evaluated or logged. systemd does > > > > (void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options); > > > > where "options" are those from /proc/self/mountinfo sans ro|rw. > > > > Probably it should log it at least with debug level. > > So I've asked over on the XFS about this, and they suggest all of this > is expected behavior under the circumstances. The sync only means data > is committed to disk with an appropriate journal entry, it doesn't > mean fs metadata is up to date, and it's the fs metadata that GRUB is > depending on, but isn't up to date yet. So the suggestion is that if > remount-ro fails, to use freeze/unfreeze and then reboot. The I am sorry, but XFS is really broken here. All init systems since time began kinda did the same thing when shutting down: a) try to unmount all fs that can be unmounted b) for the remaining ones, try to remount ro (the root fs usually qualifies) c) sync() d) reboot() That's how sysvinit does it, how Upstart does it, and systemd does it the same way. (Well, if the initrd supports it we go one step further though, and optionally pivot back to the initrd which can then unmount the root file system, too. That's a systemd innovation however, and only supported on initrd systems where the initrd supports it) If the XFS devs think that the sync() before reboot() can be partially ignored, then I am sorry for them, but that makes XFS pretty much incompatible with every init system in existence. Or to say this differently: if they expect us to invoke some magic per-filesystem ioctl() before reboot(), then that's nonsense. No init system calls that, and I am strongly against such hacks. They should just fix their APIs. Moreover, the man page of sync() is pretty clear on this: "sync() causes all pending modifications to file system metadata and cached file data to be written to the underlying filesystems." It explicitly mentions metadata. Any way you turn it, the XFS folks are just confused if they really claim sync() doesn't have to sync metadata. History says differently, and so does the man page documentation. > If it's useful I'll file an issue with systemd on github to get a > freeze/unfreeze inserted. remount-ro isn't always successful, and > clearly it's not ok to reboot anyway if remount-ro fails. I don't think we'd merge such a patch. The XFS folks should implement documented behaviour and that'll not just fix things with systemd, but with any init system. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, 27.03.17 22:27, Mantas Mikulėnas (graw...@gmail.com) wrote: > On Mon, Mar 27, 2017 at 10:20 PM, Chris Murphy> wrote: > > > Ok so the dirty file system problem always happens with all pk offline > > updates on Fedora using either ext4 or XFS with any layout; and it's > > easy to reproduce. > > > > 1. Clean install any version of Fedora, defaults. > > 2. Once Gnome Software gives notification of updates, Restart & Install > > 3. System reboots, updates are applied, system reboots again. > > 4. Now check the journal filtering for 'fsck' and you'll see it > > replayed the journals; if using XFS check the filter for "XFS" and > > you'll see the kernel did journal replace at mount time. > > > > Basically systemd is rebooting even though the remoun-ro fails > > multiple times, due to plymouth running off root fs and being exempt > > from being killed, and then reboots anyway, leaving the file system > > dirty. So it seems like a flaw to me to allow an indefinite exemption > > from killing a process that's holding a volume rw, preventing > > remount-ro before reboot. > > > > So there's a risk that in other configurations this makes either ext4 > > and XFS systems unbootable following an offline update. > > So on the one hand it's probably a Plymouth bug or misconfiguration (it > shouldn't mark itself exempt unless it runs off an in-memory initramfs). Correct. Plymouth shouldn't mark it itself this way, unless it runs from the initrd. The documentation says this very explicitly: Again: if your code is being run from the root file system, then this logic suggested above is NOT for you. Sorry. Talk to us, we can probably help you to find a different solution to your problem. See https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/. That said, a file system remaining mounting during shutdown is ugly but shouldn't result in dataloss, as we do call sync() before reboot(), and so does any other init system (see other mail). Hence, there are two bugs here: a) an ugliness in plymouth (or the way it is used by fedora's package update logic), resulting in something that is mostly a cosmetic problem b) XFS is simply broken, if we call sync() it should sync metadata, that happens to be triggered by a). Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Thu, Mar 30, 2017 at 1:24 PM, Lennart Poetteringwrote: > Or to say this differently: if they expect us to invoke some magic > per-filesystem ioctl() before reboot(), then that's nonsense. No init > system calls that, and I am strongly against such hacks. They should > just fix their APIs. > On the other hand, no other init system generally supports exclusions from the killing spree... As for freezing, that feature seems to have been made generic in 2.6.28 (FIFREEZE/FITHAW), although I couldn't find much documentation on it. Looks mainly meant for snapshots and backups – not for regular reboots. -- Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Tue, 28.03.17 11:31, Chris Murphy (li...@colorremedies.com) wrote: > OK but it's obviously possible for a developer to run a process from > root fs, and mark it kill exempt. That's the problem under discussion, > the developer is doing the wrong thing, and it's allowed. And it's > been going on for a very long time (at least 5 releases of Fedora) We expect that people who use this functionality are careful with it, and we made sure to document this all very explicitly: https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/ We even say very clearly what the correct way is to detect whether we are running from the initrd or the host system. But anyway, I'd claim that the main culprit is XFS here. Lennart -- Lennart Poettering, Red Hat ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Tue, Mar 28, 2017 at 8:31 PM, Chris Murphywrote: > On Tue, Mar 28, 2017 at 10:41 AM, Mantas Mikulėnas > wrote: > > So the same applies to plymouth, IMO -- it should only mark itself > exempt if > > it runs from the initramfs and knows that it won't interfere. > > How is this exemption specified? Would it be part of the plymouth > packaging? > https://cgit.freedesktop.org/plymouth/commit/?id=9e5a276f322cfce46b5b2ed2125cb9ec67df7e9f -- Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Tue, Mar 28, 2017 at 10:41 AM, Mantas Mikulėnaswrote: > On Tue, Mar 28, 2017 at 5:01 PM, Chris Murphy > wrote: >> >> On Mon, Mar 27, 2017 at 1:27 PM, Mantas Mikulėnas >> wrote: >> > On Mon, Mar 27, 2017 at 10:20 PM, Chris Murphy >> > wrote: >> >> >> >> Ok so the dirty file system problem always happens with all pk offline >> >> updates on Fedora using either ext4 or XFS with any layout; and it's >> >> easy to reproduce. >> >> >> >> 1. Clean install any version of Fedora, defaults. >> >> 2. Once Gnome Software gives notification of updates, Restart & Install >> >> 3. System reboots, updates are applied, system reboots again. >> >> 4. Now check the journal filtering for 'fsck' and you'll see it >> >> replayed the journals; if using XFS check the filter for "XFS" and >> >> you'll see the kernel did journal replace at mount time. >> >> >> >> Basically systemd is rebooting even though the remoun-ro fails >> >> multiple times, due to plymouth running off root fs and being exempt >> >> from being killed, and then reboots anyway, leaving the file system >> >> dirty. So it seems like a flaw to me to allow an indefinite exemption >> >> from killing a process that's holding a volume rw, preventing >> >> remount-ro before reboot. >> >> >> >> So there's a risk that in other configurations this makes either ext4 >> >> and XFS systems unbootable following an offline update. >> > >> > >> > So on the one hand it's probably a Plymouth bug or misconfiguration (it >> > shouldn't mark itself exempt unless it runs off an in-memory initramfs). >> >> OK. But does it even make sense to have a process exempt from killing, >> when it's going to get killed by reboot? Seems to me once we're at >> remount-ro or umount time, nothing is exempt, they need to be forcibly >> killed, clean up the file system, and then reboot. > > > Processes are killed *before* the remount/unmount stage. The primary users > of kill-exemption would therefore be daemons which *provide* access to the > root filesystem, e.g. iscsid, rpc helper daemons, or even ntfs-3g. > (Naturally these are expected to be running from the initramfs.) OK but it's obviously possible for a developer to run a process from root fs, and mark it kill exempt. That's the problem under discussion, the developer is doing the wrong thing, and it's allowed. And it's been going on for a very long time (at least 5 releases of Fedora) > So the same applies to plymouth, IMO -- it should only mark itself exempt if > it runs from the initramfs and knows that it won't interfere. How is this exemption specified? Would it be part of the plymouth packaging? I recognize the immediate bug is with plymouth so to progress this forward I'm happy to assign the Fedora bug and leave it up to Fedora devs to figure out whether plymouth should go in the initramfs, or remove the kill exemption. But long term, I still think there's a roll for systemd here, to disregard process kill exemption if they're running from a volume that's about to be remounted-ro or umounted. Preventing that is asking for big problems, as seen here. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Tue, Mar 28, 2017 at 5:01 PM, Chris Murphywrote: > On Mon, Mar 27, 2017 at 1:27 PM, Mantas Mikulėnas > wrote: > > On Mon, Mar 27, 2017 at 10:20 PM, Chris Murphy > > wrote: > >> > >> Ok so the dirty file system problem always happens with all pk offline > >> updates on Fedora using either ext4 or XFS with any layout; and it's > >> easy to reproduce. > >> > >> 1. Clean install any version of Fedora, defaults. > >> 2. Once Gnome Software gives notification of updates, Restart & Install > >> 3. System reboots, updates are applied, system reboots again. > >> 4. Now check the journal filtering for 'fsck' and you'll see it > >> replayed the journals; if using XFS check the filter for "XFS" and > >> you'll see the kernel did journal replace at mount time. > >> > >> Basically systemd is rebooting even though the remoun-ro fails > >> multiple times, due to plymouth running off root fs and being exempt > >> from being killed, and then reboots anyway, leaving the file system > >> dirty. So it seems like a flaw to me to allow an indefinite exemption > >> from killing a process that's holding a volume rw, preventing > >> remount-ro before reboot. > >> > >> So there's a risk that in other configurations this makes either ext4 > >> and XFS systems unbootable following an offline update. > > > > > > So on the one hand it's probably a Plymouth bug or misconfiguration (it > > shouldn't mark itself exempt unless it runs off an in-memory initramfs). > > OK. But does it even make sense to have a process exempt from killing, > when it's going to get killed by reboot? Seems to me once we're at > remount-ro or umount time, nothing is exempt, they need to be forcibly > killed, clean up the file system, and then reboot. > Processes are killed *before* the remount/unmount stage. The primary users of kill-exemption would therefore be daemons which *provide* access to the root filesystem, e.g. iscsid, rpc helper daemons, or even ntfs-3g. (Naturally these are expected to be running from the initramfs.) So the same applies to plymouth, IMO -- it should only mark itself exempt if it runs from the initramfs and knows that it won't interfere. (Unrelated, but I should also mention that systemd-shutdown has a "shutdown initramfs" feature, where it can jump *back* to the initramfs and let its "/shutdown" script do additional cleanup steps.) -- Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, Mar 27, 2017 at 1:27 PM, Mantas Mikulėnaswrote: > On Mon, Mar 27, 2017 at 10:20 PM, Chris Murphy > wrote: >> >> Ok so the dirty file system problem always happens with all pk offline >> updates on Fedora using either ext4 or XFS with any layout; and it's >> easy to reproduce. >> >> 1. Clean install any version of Fedora, defaults. >> 2. Once Gnome Software gives notification of updates, Restart & Install >> 3. System reboots, updates are applied, system reboots again. >> 4. Now check the journal filtering for 'fsck' and you'll see it >> replayed the journals; if using XFS check the filter for "XFS" and >> you'll see the kernel did journal replace at mount time. >> >> Basically systemd is rebooting even though the remoun-ro fails >> multiple times, due to plymouth running off root fs and being exempt >> from being killed, and then reboots anyway, leaving the file system >> dirty. So it seems like a flaw to me to allow an indefinite exemption >> from killing a process that's holding a volume rw, preventing >> remount-ro before reboot. >> >> So there's a risk that in other configurations this makes either ext4 >> and XFS systems unbootable following an offline update. > > > So on the one hand it's probably a Plymouth bug or misconfiguration (it > shouldn't mark itself exempt unless it runs off an in-memory initramfs). OK. But does it even make sense to have a process exempt from killing, when it's going to get killed by reboot? Seems to me once we're at remount-ro or umount time, nothing is exempt, they need to be forcibly killed, clean up the file system, and then reboot. > But on the other hand, are filesystems really so fragile? Even though it's > after a system upgrade (which updated many files), I was sure systemd at > least tries to *sync* all remaining filesystems before reboot, doesn't it? All sync does is flush data and the log to disk, not file system metadata. While this is crash safe, by not either remount-ro or umount of root fs, doing a reboot anyway is basically a crash as far as the file system is concerned. So it has to do log recovery at next mount, which the bootloader can't do. The bootloader depends on file system metadata being correct. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
Ok so the dirty file system problem always happens with all pk offline updates on Fedora using either ext4 or XFS with any layout; and it's easy to reproduce. 1. Clean install any version of Fedora, defaults. 2. Once Gnome Software gives notification of updates, Restart & Install 3. System reboots, updates are applied, system reboots again. 4. Now check the journal filtering for 'fsck' and you'll see it replayed the journals; if using XFS check the filter for "XFS" and you'll see the kernel did journal replace at mount time. Basically systemd is rebooting even though the remoun-ro fails multiple times, due to plymouth running off root fs and being exempt from being killed, and then reboots anyway, leaving the file system dirty. So it seems like a flaw to me to allow an indefinite exemption from killing a process that's holding a volume rw, preventing remount-ro before reboot. So there's a risk that in other configurations this makes either ext4 and XFS systems unbootable following an offline update. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Mon, Mar 27, 2017 at 10:20 PM, Chris Murphywrote: > Ok so the dirty file system problem always happens with all pk offline > updates on Fedora using either ext4 or XFS with any layout; and it's > easy to reproduce. > > 1. Clean install any version of Fedora, defaults. > 2. Once Gnome Software gives notification of updates, Restart & Install > 3. System reboots, updates are applied, system reboots again. > 4. Now check the journal filtering for 'fsck' and you'll see it > replayed the journals; if using XFS check the filter for "XFS" and > you'll see the kernel did journal replace at mount time. > > Basically systemd is rebooting even though the remoun-ro fails > multiple times, due to plymouth running off root fs and being exempt > from being killed, and then reboots anyway, leaving the file system > dirty. So it seems like a flaw to me to allow an indefinite exemption > from killing a process that's holding a volume rw, preventing > remount-ro before reboot. > > So there's a risk that in other configurations this makes either ext4 > and XFS systems unbootable following an offline update. So on the one hand it's probably a Plymouth bug or misconfiguration (it shouldn't mark itself exempt unless it runs off an in-memory initramfs). But on the other hand, are filesystems really so fragile? Even though it's after a system upgrade (which updated many files), I was sure systemd at least tries to *sync* all remaining filesystems before reboot, doesn't it? -- Mantas Mikulėnas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Tue, Mar 21, 2017 at 9:48 PM, Andrei Borzenkovwrote: > 22.03.2017 00:10, Chris Murphy пишет: >> OK so I had the idea to uninstall plymouth, since that's estensibly >> what's holding up the remount read-only. But it's not true. >> >> Sending SIGTERM to remaining processes... >> Sending SIGKILL to remaining processes... >> Unmounting file systems. >> Remounting '/tmp' read-only with options 'seclabel'. >> Unmounting /tmp. >> Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. >> Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. >> Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. >> All filesystems unmounted. > > Could you show your /proc/self/mountinfo before starting shutdown (or > ideally just before systemd goes into uount all)? This suggests that "/" > appears there three times there. I'm too stupid to figure out how to get virsh console to attach to tty9/early debug shell but here's a screen shot right as pk-offline-update is done, maybe 2 seconds before the remounting and reboot. https://drive.google.com/open?id=0B_2Asp8DGjJ9NXRGTTFjSlVPSU0 > > Result code of "remount ro" is not evaluated or logged. systemd does > > (void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options); > > where "options" are those from /proc/self/mountinfo sans ro|rw. > > Probably it should log it at least with debug level. So I've asked over on the XFS about this, and they suggest all of this is expected behavior under the circumstances. The sync only means data is committed to disk with an appropriate journal entry, it doesn't mean fs metadata is up to date, and it's the fs metadata that GRUB is depending on, but isn't up to date yet. So the suggestion is that if remount-ro fails, to use freeze/unfreeze and then reboot. The difference with freeze/unfreeze and remount-ro is that freeze/unfreeze will update fs metadata even if there's something preventing remount-ro. If it's useful I'll file an issue with systemd on github to get a freeze/unfreeze inserted. remount-ro isn't always successful, and clearly it's not ok to reboot anyway if remount-ro fails. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
22.03.2017 00:10, Chris Murphy пишет: > OK so I had the idea to uninstall plymouth, since that's estensibly > what's holding up the remount read-only. But it's not true. > > Sending SIGTERM to remaining processes... > Sending SIGKILL to remaining processes... > Unmounting file systems. > Remounting '/tmp' read-only with options 'seclabel'. > Unmounting /tmp. > Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. > Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. > Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. > All filesystems unmounted. Could you show your /proc/self/mountinfo before starting shutdown (or ideally just before systemd goes into uount all)? This suggests that "/" appears there three times there. Result code of "remount ro" is not evaluated or logged. systemd does (void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options); where "options" are those from /proc/self/mountinfo sans ro|rw. Probably it should log it at least with debug level. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
OK so I had the idea to uninstall plymouth, since that's estensibly what's holding up the remount read-only. But it's not true. Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... Unmounting file systems. Remounting '/tmp' read-only with options 'seclabel'. Unmounting /tmp. Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. All filesystems unmounted. Deactivating swaps. All swaps deactivated. Detaching loop devices. device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class All loop devices detached. Detaching DM devices. device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class All DM devices detached. Spawned /usr/lib/systemd/system-shutdown/mdadm.shutdown as 7058. /usr/lib/systemd/system-shutdown/mdadm.shutdown succeeded. system-shutdown succeeded. Failed to read reboot parameter file: No such file or directory Rebooting. [ 47.288419] Unregister pv shared memory for cpu 0 [ 47.289140] Unregister pv shared memory for cpu 1 [ 47.290013] sd 1:0:0:0: [sda] Synchronizing SCSI cache [ 47.315486] reboot: Restarting system [ 47.316036] reboot: machine restart There are still three attempts to remount read-only. Why? Separately checking the file system following this reboot, the fs is clean, not dirty. So one of those remounts must have worked this time. And the file system is bootable. There really isn't enough debugging within system to isolate everything that's going on here. Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
On Tue, Mar 21, 2017 at 12:04 AM, Chris Murphywrote: > > c. Only XFS is left in a dirty state following the reboot. Ext4 and Btrfs > are OK. This is incorrect. This problem affects ext4 as well, it's just that on ext4, while the fs is left in a dirty state, the modified grub.cfg is still readable and boot is possible. But boot after pk offline update, always includes journal replay. Basically these reboots are leaving file systems dirty. I can't tell from the available information if it's a systemd bug, or a kernel bug. The file system remount to read-only is failing, and a umount isn't attempted. And I guess between systemd and the kernel, they're deciding to reboot anyway, resulting in this problem. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
Thanks for the reply. On Mon, Mar 20, 2017, 11:05 PM Mantas Mikulėnaswrote: > First thought: Even without the exit code or anything, it's going to be > -EBUSY like 99.999% of the time. Not much else can fail during umount. > > And ”Filesystem is busy" would perfectly fit the earlier error message > which you overlooked: > > "Process 304 (plymouthd) has been marked to be excluded from killing. > It is running from the root file system, and thus likely to block > re-mounting of the root file system to read-only." > > So you have a process holding / open (Plymouth is the boot splash screen > app) and the kernel doesn't allow it to be umounted due to that. > a. Seems flawed to have something that can block remount to read only. Either a flaw of Plymouth directly, or running it from root fs rather than the initramfs. b. This message occurs, as well as the three remount ro messages, regardless of filesystem (volume format). c. Only XFS is left in a dirty state following the reboot. Ext4 and Btrfs are OK. So I'm still left with why XFS is affected, and XFS devs want to know the exit code. At reboot/shutdown time, exactly what does systemd issue to the kernel to do this? Chris Murphy > ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
First thought: Even without the exit code or anything, it's going to be -EBUSY like 99.999% of the time. Not much else can fail during umount. And ”Filesystem is busy" would perfectly fit the earlier error message which you overlooked: "Process 304 (plymouthd) has been marked to be excluded from killing. It is running from the root file system, and thus likely to block re-mounting of the root file system to read-only." So you have a process holding / open (Plymouth is the boot splash screen app) and the kernel doesn't allow it to be umounted due to that. On Tue, Mar 21, 2017, 05:25 Chris Murphywrote: > Any thoughts on this? > > I've followed these instructions: > https://freedesktop.org/wiki/Software/systemd/Debugging/ > Shutdown Completes Eventually > > However, no additional information is being logged that gives any > answer to why there are three remount ro attempts, and why they aren't > succeeding. > > https://github.com/systemd/systemd/blob/master/src/core/umount.c > line 409 > > This suggests three ro attempts shouldn't happen. And then 413 says > that / won't actually get umounted, reboot happens leaving it ro > mounted. So the "All filesystems unmounted." doesn't tell us anything; > but it does seem like there should be a way to expose exit code for > umount. I'm just not sure how to do it, and if that means compiling > systemd myself. > > -- > Chris Murphy > ___ > systemd-devel mailing list > systemd-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/systemd-devel > -- Mantas Mikulėnas Sent from my phone ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?
Any thoughts on this? I've followed these instructions: https://freedesktop.org/wiki/Software/systemd/Debugging/ Shutdown Completes Eventually However, no additional information is being logged that gives any answer to why there are three remount ro attempts, and why they aren't succeeding. https://github.com/systemd/systemd/blob/master/src/core/umount.c line 409 This suggests three ro attempts shouldn't happen. And then 413 says that / won't actually get umounted, reboot happens leaving it ro mounted. So the "All filesystems unmounted." doesn't tell us anything; but it does seem like there should be a way to expose exit code for umount. I'm just not sure how to do it, and if that means compiling systemd myself. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] more verbose debug info than systemd.log_level=debug?
I've got a Fedora 22, 23, 24, 25 bug where systemd offline updates of kernel results in an unbootable system when on XFS only (/boot is a directory), the system boots to a grub menu. The details of that are in this bug's comment: https://bugzilla.redhat.com/show_bug.cgi?id=1227736#c39 The gist of that is the file system is dirty following offline update, and the grub.cfg is 0 length. If the fs is mounted with a rescue system, the XFS journal is replayed and cleans things up, now there is a valid grub.cfg, and at the next reboot there is a grub menu as expected with the newly installed kernel. That bug is on baremetal for another user, but I've reproduced it in a qemu-kvm where I use boot parameters systemd.log_level=debug systemd.log_target=console console=ttyS0,38400 and virsh console to capture what's going on during the offline update that results in the dirty file system. What I get is more confusing than helpful: Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... Process 304 (plymouthd) has been marked to be excluded from killing. It is running from the root file system, and thus likely to block re-mounting of the root file system to read-only. Please consider moving it into an initrd file system instead. Unmounting file systems. Remounting '/tmp' read-only with options 'seclabel'. Unmounting /tmp. Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. Remounting '/' read-only with options 'seclabel,attr2,inode64,noquota'. All filesystems unmounted. Deactivating swaps. All swaps deactivated. Detaching loop devices. device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class All loop devices detached. Detaching DM devices. device-enumerator: scan all dirs device-enumerator: scanning /sys/bus device-enumerator: scanning /sys/class All DM devices detached. Spawned /usr/lib/systemd/system-shutdown/mdadm.shutdown as 8408. /usr/lib/systemd/system-shutdown/mdadm.shutdown succeeded. system-shutdown succeeded. Failed to read reboot parameter file: No such file or directory Rebooting. [ 52.963598] Unregister pv shared memory for cpu 0 [ 52.965736] Unregister pv shared memory for cpu 1 [ 52.966795] sd 1:0:0:0: [sda] Synchronizing SCSI cache [ 52.991220] reboot: Restarting system [ 52.993119] reboot: machine restart 1. Why are there three remount read-only entries? Are these failing? These same three entries happen when the file system is Btrfs, so it's not an XFS specific anomaly. 2. All filesystems unmounted. What condition is required to generate this message? I guess I'm asking if it's reliable. Or if it's possible after three failed read-only remounts that systemd gives up and claims the file systems are unmounted, and then reboots? There is an XFS specific problem here, as the dirty fs problem only happens on XFS; the file system is clean if it's ext4 or Btrfs. Nevertheless it looks like something is holding up the remount, and there's no return value from umount logged. Is there a way to get more information during shutdown than this? The question at this point is why is the XFS volume dirty at reboot time, but there's not much to go on, as I get all the same console messages for ext4 and Btrfs which don't have a dirty fs at reboot following offline update. Thanks, -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel