On Wed, 20.07.16 12:53, Daniel P. Berrange (berra...@redhat.com) wrote:

> For virtualized hosts it is quite common to want to confine all host OS
> processes to a subset of CPUs/RAM nodes, leaving the rest available for
> exclusive use by QEMU/KVM.  Historically people have used the "isolcpus"
> kernel arg todo this, but last year that had its semantics changed, so
> that any CPUs listed there also get excluded from load balancing by the
> schedular making it quite useless in general non-real-time use cases
> where you still want QEMU threads load-balanced across CPUs.
> 
> So the only option is to use the cpuset cgroup controller to confine
> procosses. AFAIK, systemd does not have an explicit support for the cpuset
> controller at this time, so I'm trying to work out the "optimal" way to
> achieve this behind systemd's back while minimising the risk that future
> systemd releases will break things.

Yes, we don't support this as of now, but we'd like to. The thing
though is that the kernel interface for it is pretty borked as it is
right now, and until that's not fixed we are unlikely going to support
this in systemd. (And as I understood Tejun the mem vs. cpu thing in
cpuset is probably not going to stay the way it is either)

But note that the non-cgroup CPUAffinity= setting should be good
enough for many use cases. Are you sure that isn't sufficient for you?

Also note that systemd supports setting a system-wide CPUAffinity= for
itself during early boot, thus leaving all unlisted CPUs free for
specific services where you use CPUAffinity= to change this default.

> As an example I have a host with 3 NUMA nodes, 12 CPUS and want to have
> all non-QEMU processes running on CPUs 0 & 1, leaving 3-11 available
> for QEMU machines
> 
> So far my best solution looks like this:
> 
> $ cat /etc/systemd/system/cpuset.service
> [Unit]
> Description=Restrict CPU placement
> DefaultDependencies=no
> Before=sysinit.target slices.target basic.target lvm2-lvmetad.service 
> systemd-journald.service systemd-udevd.service
> 
> [Service]
> Type=oneshot
> KillMode=none
> RemainAfterExit=yes
> ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice
> ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/machine.slice
> ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > 
> /sys/fs/cgroup/cpuset/system.slice/cpuset.cpus'
> ExecStartPre=/bin/bash -c '/usr/bin/echo "0" > 
> /sys/fs/cgroup/cpuset/system.slice/cpuset.mems'
> ExecStartPre=/bin/bash -c '/usr/bin/echo "3-11" > 
> /sys/fs/cgroup/cpuset/machine.slice/cpuset.cpus'
> ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > 
> /sys/fs/cgroup/cpuset/machine.slice/cpuset.mems'
> ExecStartPost=/bin/bash -c '/usr/bin/echo 1 >
> /sys/fs/cgroup/cpuset/system.slice/tasks'

Current systemd versions never place processes in slices, and PID 1
sits in "init.scope" hence.

> The key factor here is use of "Before" to ensure this gets run immediately
> after systemd switches root out of the initrd, and before /any/ long lived
> services are run. This lets us set cpuset placement on systemd (pid 1)
> itself and have that inherited by everything it spawns. I felt this is
> better than trying to move processes after they have already started,
> because it ensures that any memory allocations get taken from the right
> NUMA node immediately.
>
> Empirically this approach seems to work on Fedora 23 (systemd 222) and
> RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
> not anticipated here.

Yes, PID 1 was moved to the special scope unit init.scope as mentioned
above (in preparation for cgroupsv2 where inner cgroups can never
contain PIDs). This is likely going to break then.

But again, I have the suspicion that CPUAffinity= might already
suffice for you?

Lennart

-- 
Lennart Poettering, Red Hat
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Reply via email to