[systemd-devel] Antw: [EXT] Re: Still confused with socket activation
>>> Ulrich Windl schrieb am 03.02.2021 um 10:34 in Nachricht <601A6E3D.E40 : 161 : 60728>: Lennart Poettering schrieb am 02.02.2021 um 15:59 in > Nachricht <20210202145954.GB36677@gardel-login>: > > On Di, 02.02.21 10:43, Ulrich Windl (ulrich.wi...@rz.uni‑regensburg.de) > wrote: > > > >> Hi! > >> > >> Having: > >> ‑‑‑ > >> # /usr/lib/systemd/system/virtlockd.service > >> [Unit] > >> Description=Virtual machine lock manager > >> Requires=virtlockd.socket > >> Requires=virtlockd‑admin.socket > >> Before=libvirtd.service > >> ... > >> ‑‑‑ > >> > >> How would I start both sockets successfully unter program control? > >> If I start one socket, I cannot start the other without an error (as > > libvirtd.service is running already, see my earlier message from last week). > >> If I mask the socket units, I cannot start the libvirtd.service. > >> So would I disable the socket units and start libvirtd.service? > >> Unfortunately if someone (update when vendor‑preset is enabled) re‑enables > the > > socket units, it would break things, so I tried to mask them, but that > > failed, too. > >> error: Could not issue start for prm_virtlockd: Unit virtlockd.socket is > > masked. > > > > I don't grok what you are trying to say, the excerpt of the unit file > > is too short. Please provide the relevant parts of the other unit > > files too. > > So you get it: > > > # systemctl cat virtlockd.service > # /usr/lib/systemd/system/virtlockd.service > [Unit] > Description=Virtual machine lock manager > Requires=virtlockd.socket > Requires=virtlockd-admin.socket > Before=libvirtd.service > Documentation=man:virtlockd(8) > Documentation=https://libvirt.org > > [Service] > EnvironmentFile=-/etc/sysconfig/virtlockd > ExecStart=/usr/sbin/virtlockd $VIRTLOCKD_ARGS > ExecReload=/bin/kill -USR1 $MAINPID > # Loosing the locks is a really bad thing that will > # cause the machine to be fenced (rebooted), so make > # sure we discourage OOM killer > OOMScoreAdjust=-900 > # Needs to allow for max guests * average disks per guest > # libvirtd.service written to expect 4096 guests, so if we > # allow for 10 disks per guest, we get: > LimitNOFILE=40960 > > [Install] > Also=virtlockd.socket > > # /run/systemd/system/virtlockd.service.d/50-pacemaker.conf > [Unit] > Description=Cluster Controlled virtlockd > Before=pacemaker.service pacemaker_remote.service > > [Service] > Restart=no > > # systemctl cat virtlockd.socket > # /usr/lib/systemd/system/virtlockd.socket > [Unit] > Description=Virtual machine lock manager socket > Before=libvirtd.service > > [Socket] > ListenStream=/run/libvirt/virtlockd-sock > SocketMode=0600 > > [Install] > WantedBy=sockets.target > > # /usr/lib/systemd/system/virtlockd-admin.socket > [Unit] > Description=Virtual machine lock manager admin socket > Before=libvirtd.service > BindsTo=virtlockd.socket > After=virtlockd.socket > > [Socket] > ListenStream=/run/libvirt/virtlockd-admin-sock > Service=virtlockd.service > SocketMode=0600 > > [Install] > WantedBy=sockets.target > > To make things worse: libvirtd also requires virtlockd: > > # systemctl cat libvirtd.service > # /usr/lib/systemd/system/libvirtd.service > [Unit] > Description=Virtualization daemon > Requires=virtlogd.socket > Requires=virtlockd.socket > # Use Wants instead of Requires so that users > # can disable these three .socket units to revert > # to a traditional non-activation deployment setup > Wants=libvirtd.socket > Wants=libvirtd-ro.socket > Wants=libvirtd-admin.socket > Wants=systemd-machined.service > Before=libvirt-guests.service > After=network.target > After=dbus.service > After=iscsid.service > After=apparmor.service > After=local-fs.target > After=remote-fs.target > After=systemd-logind.service > After=systemd-machined.service > After=xencommons.service > Conflicts=xendomains.service > Documentation=man:libvirtd(8) > Documentation=https://libvirt.org > > [Service] > Type=notify > EnvironmentFile=-/etc/sysconfig/libvirtd > ExecStart=/usr/sbin/libvirtd $LIBVIRTD_ARGS > ExecReload=/bin/kill -HUP $MAINPID > KillMode=process > Restart=on-failure > # At least 1 FD per guest, often 2 (eg qemu monitor + qemu agent). > # eg if we want to support 4096 guests, we'll typically need 8192 FDs > # If changing this, also consider virtlogd.service & virtlockd.service > # limits which are also related to number of guests > LimitNOFILE=8192 > # The cgroups pids controller can limit the number of tasks started by > # the daemon, which can limit the number of domains for some hypervisors. > # A conservative default of 8 tasks per guest results in a TasksMax of > # 32k to support 4096 guests. > TasksMax=32768 > > [Install] > WantedBy=multi-user.target > Also=virtlockd.socket > Also=virtlogd.socket > Also=libvirtd.socket > Also=libvirtd-ro.socket > > # systemctl cat libvirtd.socket > # /usr/lib/systemd/system/libvirtd.socket > [Unit] > Description=Libvirt local socket > Before=libvirtd.service > > > [Socket] > # The directory must
Re: [systemd-devel] udev and btrfs multiple devices
On Thu, Feb 4, 2021 at 6:28 AM Lennart Poettering wrote: > > On Mi, 03.02.21 22:32, Chris Murphy (li...@colorremedies.com) wrote: > > It doesn't. It waits indefinitely. > > > > [* ] A start job is running for > > /dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no > > limit) > > Is this on encrypted media? No. Plain partitions. -- Chris Murphy ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Thu, Feb 4, 2021 at 6:49 AM Lennart Poettering wrote: > You want to optimize write pattersn I understand, i.e. minimize > iops. Hence start with profiling iops, i.e. what defrag actually costs > and then weight that agains the reduced access time when accessing the > files. In particular on rotating media. A nodatacow journal on Btrfs is no different than a journal on ext4 or xfs. So I don't understand why you think you *also* need to defragment the file, only on Btrfs. You cannot do better than you already are with a nodatacow file. That file isn't going to get anymore fragmented in use than it was at creation. If you want to do better, maybe stop appending in 8MB increments? Every time you append it's another extent. Since apparently the journal files can max out at 128MB before they are rotated, why aren't they created 128MB from the very start? That would have a decent chance of getting you a file that's 1-4 extents, and it's not going to have more extents than that. Presumably the currently active journal not being fragmented is more important than archived journals, because searches will happen on recent events more than old events. Right? So if you're going to say fragmentation matters at all, maybe stop intentionally fragmenting the active journal? Just fallocate the max size it's going to be right off the bat? Doesn't matter what file system it is. Once that 128MB journal is full, leave it alone, and rotate to a new 128M file. The append is what's making them fragmented. But it gets worse. The way systemd-journald is submitting the journals for defragmentation is making them more fragmented than just leaving them alone. https://drive.google.com/file/d/1FhffN4WZZT9gZTnG5VWongWJgPG_nlPF/view?usp=sharing All of those archived files have more fragments (post defrag) than they had when they were active. And here is the FIEMAP for the 96MB file which has 92 fragments. https://drive.google.com/file/d/1Owsd5DykNEkwucIPbKel0qqYyS134-tB/view?usp=sharing I don't know if it's a bug with the submitted target size by sd-journald, or if it's a bug in Btrfs. But it doesn't really matter. There is no benefit to defragmenting nodatacow journals that were fallocated upon creation. If you want an optimization that's actually useful on Btrfs, /var/log/journal/ could be a nested subvolume. That would prevent any snapshots above from turning the nodatacow journals into datacow journals, which does significantly increase fragmentation (it would in the exact same case if it were a reflink copy on XFS for that matter). > No, but doing this once in a big linear stream when the journal is > archived might not be so bad if then later on things are much faster > to access for all future because the files aren't fragmented. Ok well in practice is worse than doing nothing so I'm suggesting doing nothing. > Somehow I think you are missing what I am asking for: some data that > actually shows your optimization is worth it: i.e. that leaving the > files fragment doesn't hurt access to the journal badly, and that the > number of iops is substantially lowered at the same time. I don't get the iops thing at all. What we care about in this case is latency. A least noticeable latency of around 150ms seems reasonable as a starting point, that's where users realize a delay between a key press and a character appearing. However, if I check for 10ms latency (using bcc-tools fileslower) when reading all of the above journals at once: $ sudo journalctl -D /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager Not a single report. None. Nothing took even 10ms. And those journals are more fragmented than your 20 in a 100MB file. I don't have any hard drives to test this on. This is what, 10% of the market at this point? The best you can do there is the same as on SSD. You can't depend on sysfs to conditionally do defragmentation on only rotational media, too many fragile media claim to be rotating. And by the way, I use Brfs on SD Card on a Raspberry Pi Zero of all things. The cards last longer than other file systems due to net lower write amplification due to native compression. I wouldn't be surprised if the cards fail sooner if I weren't using compression. But who knows, maybe Btrfs write amplification compared to ext4 and xfs constant journaling ends up being a wash. There are a number of embedded use cases for Btrfs as well. Is compressed F2FS better? Probably. They have a solution for the wandering trees problem, but also no snapshots or data checksumming. But I also don't think any of that is super relevant to the overall topic, I just provide this as a contra-argument that Btrfs isn't appropriate for small cheap storage devices. > The thing is that we tend to have few active files and many archived > files, and since we interleave stuff our access patterns are pretty > bad already, so we don't want to spend even more time on paying for > extra bad access patterns becuase the archived files are
Re: [systemd-devel] Still confused with socket activation
03.02.2021 22:25, Benjamin Berg пишет: > On Wed, 2021-02-03 at 20:47 +0300, Andrei Borzenkov wrote: >> 03.02.2021 00:25, Benjamin Berg пишет: >>> On Tue, 2021-02-02 at 22:50 +0300, Andrei Borzenkov wrote: 02.02.2021 17:59, Lennart Poettering пишет: > > Note that Requires= in almost all cases should be combined with > an > order dep of After= onto the same unit. Years ago I asked for example when Requires makes sense without After. Care to show it? I assume you must have use case if you say "in almost all". >>> >>> In the GNOME systemd units there are a few places where a Requires= >>> is >>> combined with Before=. >>> >> >> This is functionally completely equivalent to simply using >> Wants+Before. >> At least as long as you rely on *documented* functions. > > Requires= actually has the difference that the unit must become part of > the transaction (if it is not active already). So you get a hard > failure and appropriate logging if the unit cannot be added to the > transaction for some reason. > Oh, I said "documented" :) systemd documentation does not even define what "transaction" is. You really need to know low level implementation details to use it in this way. But thank you, I missed this subtlety. Of course another reason could be stop behavior. >> Care to show more complete example and explain why Wants does not >> work in this case? > > Wants= would work fine. I think it boils down to whether you find the > extra assertions useful. The Requires= documentation actually suggests > using Wants= exactly to avoid this. > > Benjamin > signature.asc Description: OpenPGP digital signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Still confused with socket activation
On Thu, 04 Feb 2021 at 13:07:33 +0100, Reindl Harald wrote: > "Requires=a.service" combined with "Before=a.service" is contradictory - > don't you get that? It means what it says: whenever my service is enabled, a.service must also be enabled, but my service has to start first (and stop last). For instance, imagine this scenario: * my service sets something up that will trigger an action later: perhaps it creates a configuration fragment in a.conf.d * any number of other services might be doing the same as my service * whenever at least one service has done that, when they have all finished doing their setup, we need to start a.service, which will take those triggers (e.g. contents of a.conf.d) and "commit" whatever is needed for all of them * if multiple services have set up a triggered action, we only want to run a.service once, and it will act on all the triggers as a batch Then my service should have Requires=a.service (because if a.service never runs, the "commit" action will never happen and my service will not do what I wanted); and it should have Before=a.service (because the triggers need to be in place before a.service processes them). dpkg and RPM triggers are not (currently!) systemd services, but they work a lot like this. They're typically used to regenerate caches. Another reason to combine Requires with Before is that Before is really a short name for "start before and stop after", and After is really a short name for "start after and stop before". If you're dealing with actions taken during system shutdown or session logout, the stop action might be the one that does the real work, making it more likely that Before dependencies are the important ones. smcv ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Still confused with socket activation
On Thu, 2021-02-04 at 13:07 +0100, Reindl Harald wrote: > Am 04.02.21 um 12:46 schrieb Benjamin Berg: > > On Wed, 2021-02-03 at 16:43 +0100, Reindl Harald wrote: > > > seriously - explain what you expect to happen in case of > > > > > > Requires=a.service > > > Before=a.service > > > > > > except some warning that it's nonsense > > > > So, one way I used it is as ExecStartPost= equivalent for a .target > > unit. i.e. pull in a Type=oneshot service once a target has become > > active in order to execute a simple command > > "Requires=a.service" combined with "Before=a.service" is > contradictory - > don't you get that? Your statements will not become more informed by repeating them. It looks to me like you are interpreting Requires= incorrectly. Of course, one can see a contradiction in saying "B requires A in order to run" and then also saying "start A after B is ready". But systemd considers requirements and ordering as two independent problems. As such "Requires=A" only means something like "unit A must be added to the transaction together with B". A statement that does not imply ordering. Yes, this is a a very logical/mathematical meaning which may not be what you intuitively expect. And it does have the unfortunate side effect of sometimes confusing people and they forget to add a needed After= that they thought was implied. But, it is well defined what happens when combining Requires= with Before=. There is no contradiction. Benjamin signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
Lennart Poettering writes: > Well, at least on my system here there are still like 20 fragments per > file. That's not nothin? In a 100 mb file? It could be better, but I very much doubt you're going to notice a difference after defragmenting that. I may be the nut that rescued the old ext2 defrag utility from the dustbin of history, but even I have to admit that it isn't really important to use and there is a reasson why the linux community abandoned it. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mi, 03.02.21 23:11, Chris Murphy (li...@colorremedies.com) wrote: > On Wed, Feb 3, 2021 at 9:46 AM Lennart Poettering > wrote: > > > > Performance is terrible if cow is used on journal files while we write > > them. > > I've done it for a year on NVMe. The latency is so low, it doesn't > matter. Maybe do it on rotating media... > > It would be great if we could turn datacow back on once the files are > > archived, and then take benefit of compression/checksumming and > > stuff. not sure if there's any sane API for that in btrfs besides > > rewriting the whole file, though. Anyone knows? > > A compressed file results in a completely different encoding and > extent size, so it's a complete rewrite of the whole file, regardless > of the cow/nocow status. > > Without compression it'd be a rewrite because in effect it's a > different extent type that comes with checksums. i.e. a reflink copy > of a nodatacow file can only be a nodatacow file; a reflink copy of a > datacow file can only be a datacow file. The conversion between them > is basically 'cp --reflink=never' and you get a complete rewrite. > > But you get a complete rewrite of extents by submitting for > defragmentation too, depending on the target extent size. > > It is possible to do what you want by no longer setting nodatacow on > the enclosing dir. Create a 0 length journal file, set nodatacow on > that file, then fallocate it. That gets you a nodatacow active > journal. And then you can just duplicate it in place with a new name, > and the result will be datacow and automatically compressed if > compression is enabled. > > But the write hit has already happened by writing journal data into > this journal file during its lifetime. Just rename it on rotate. > That's the least IO impact possible at this point. Defragmenting it > means even more writes, and not much of a gain if any, unless it's > datacow which isn't the journald default. You are focussing only on the one-time iops generated during archival, and are ignoring the extra latency during access that fragmented files cost. Show me that the iops reduction during the one-time operation matters and the extra latency during access doesn't matter and we can look into making changes. But without anything resembling any form of profiling we are just blind people in the fog... Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] consider dropping defrag of journals on btrfs
On Mi, 03.02.21 22:51, Chris Murphy (li...@colorremedies.com) wrote: > > > Since systemd-journald sets nodatacow on /var/log/journal the journals > > > don't really fragment much. I typically see 2-4 extents for the life > > > of the journal, depending on how many times it's grown, in what looks > > > like 8MiB increments. The defragment isn't really going to make any > > > improvement on that, at least not worth submitting it for additional > > > writes on SSD. While laptop and desktop SSD/NVMe can handle such a > > > small amount of extra writes with no meaningful impact to wear, it > > > probably does have an impact on much more low end flash like USB > > > sticks, eMMC, and SD Cards. So I figure, let's just drop the > > > defragmentation step entirely. > > > > Quite frankly, given how iops-expensive btrfs is, one probably > > shouldn't choose btrfs for such small devices anyway. It's really not > > where btrfs shines, last time I looked. > > Btrfs aggressively delays metadata and data allocation, so I don't > agree that it's expensive. It's not a matter of agreeing or not. Last time people showed me benchmarks (which admittedly was 2 or 3 years ago), the number of iops for typical workloads is typically twice as much as on ext4. Which I don't really want to criticize, it's just the way that it is. I mean, maybe they managed to lower the iops since then, but it's not a matter of "agreeing", it's a matter of showing benchmarks that indicate this is not a problem anymore. > But in any case, reading a journal file and rewriting it out, which is > what defragment does, doesn't really have any benefit given the file > doesn't fragment much anyway due to (a) nodatacow and (b) fallocate, > which is what systemd-journald does on Btrfs. Well, at least on my system here there are still like 20 fragments per file. That's not nothin? > > Did you actually check the iops this generates? > > I don't understand the relevance. You want to optimize write pattersn I understand, i.e. minimize iops. Hence start with profiling iops, i.e. what defrag actually costs and then weight that agains the reduced access time when accessing the files. In particular on rotating media. > > Not sure it's worth doing these kind of optimizations without any hard > > data how expensive this really is. It would be premature. > > Submitting the journal for defragment in effect duplicates the > journal. Read all extents, and rewrite those blocks to a new location. > It's doubling the writes for that journal file. It's not like the > defragment is free. No, but doing this once in a big linear stream when the journal is archived might not be so bad if then later on things are much faster to access for all future because the files aren't fragmented. > Somehow I think you're missing what I've asking for, which is to stop > the unnecessary defragment step because it's not an optimization. It > doesn't meaningfully reduce fragmentation at all, it just adds write > amplification. Somehow I think you are missing what I am asking for: some data that actually shows your optimization is worth it: i.e. that leaving the files fragment doesn't hurt access to the journal badly, and that the number of iops is substantially lowered at the same time. The thing is that we tend to have few active files and many archived files, and since we interleave stuff our access patterns are pretty bad already, so we don't want to spend even more time on paying for extra bad access patterns becuase the archived files are fragment. Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] udev and btrfs multiple devices
On Mi, 03.02.21 22:32, Chris Murphy (li...@colorremedies.com) wrote: > On Thu, Jan 28, 2021 at 7:18 AM Lennart Poettering > wrote: > > > > On Mi, 27.01.21 17:19, Chris Murphy (li...@colorremedies.com) wrote: > > > > > Is it possible for a udev rule to have a timeout? For example: > > > /usr/lib/udev/rules.d/64-btrfs.rules > > > > > > This udev rule will wait indefinitely for a missing device to > > > appear. > > > > Hmm, no, that's a mis understaning. "rules" can't "wait". The > > activation of the btrfs file system won't happen, but that should then > > be caught by systemd mount timeouts and put you into recovery mode. > > It doesn't. It waits indefinitely. > > [* ] A start job is running for > /dev/disk/by-uuid/cf9c9518-45d4-43d6-8a0a-294994c383fa (12min 36s / no > limit) Is this on encrypted media? Lennart -- Lennart Poettering, Berlin ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Limitation on maximum number of systemd timers that can be active
Thank you very much Lennart for the help. I was eager to know whether there was any known limitation, hence this question. Hi Andy, I am currently building a diagnostics data collector that collects various diagnostics data at different scheduled intervals as configured by the user. systemd-timer is used for running the schedules. I need to enforce a limit on the maximum number of schedules the user can use for this feature. Currently, I am deciding the limit, hence interested in the maximum value upto which we can allow the user to configure without creating much/noticeable performance impact. I will do a performance testing in raspberry pi 3 and share my observation. Thank you all for your support On Wed, Feb 3, 2021 at 9:35 PM Lennart Poettering wrote: > On Mi, 03.02.21 12:16, P.R.Dinesh (pr.din...@gmail.com) wrote: > > > Do we have any limitation on the maximum number of systemd timers / units > > that can be active in the system? > > We currently enforce a limit of 128K units. This is controlled by > the MANAGER_MAX_NAMES define, which is hard compiled in. > > > Will it consume high cpu/memory if we configure 1000s of systemd timers? > > It will consume a bit of memory, but I'd guess it should scale OK. > > All scalability issues regarding number of units we saw many years > ago, by now all slow paths have been fixed I am aware of. I mean, we > can certainly still optimize stuff (i.e. "systemctl daemon-reload" is > expensive), but things to my knowledge having a few K of units should > be totally Ok. (But then again I don't run things like that myself, my > knowledge is purely based on feedback, or the recent lack thereof) > > Lennart > > -- > Lennart Poettering, Berlin > -- With Kind Regards, P R Dinesh ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Still confused with socket activation
Am 04.02.21 um 12:46 schrieb Benjamin Berg: On Wed, 2021-02-03 at 16:43 +0100, Reindl Harald wrote: seriously - explain what you expect to happen in case of Requires=a.service Before=a.service except some warning that it's nonsense So, one way I used it is as ExecStartPost= equivalent for a .target unit. i.e. pull in a Type=oneshot service once a target has become active in order to execute a simple command "Requires=a.service" combined with "Before=a.service" is contradictory - don't you get that? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Still confused with socket activation
On Wed, 2021-02-03 at 16:43 +0100, Reindl Harald wrote: > seriously - explain what you expect to happen in case of > > Requires=a.service > Before=a.service > > except some warning that it's nonsense So, one way I used it is as ExecStartPost= equivalent for a .target unit. i.e. pull in a Type=oneshot service once a target has become active in order to execute a simple command. Benjamin signature.asc Description: This is a digitally signed message part ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel