Re: [systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller

2016-07-20 Thread Lennart Poettering
On Wed, 20.07.16 14:49, Daniel P. Berrange (berra...@redhat.com) wrote:

> > > The key factor here is use of "Before" to ensure this gets run immediately
> > > after systemd switches root out of the initrd, and before /any/ long lived
> > > services are run. This lets us set cpuset placement on systemd (pid 1)
> > > itself and have that inherited by everything it spawns. I felt this is
> > > better than trying to move processes after they have already started,
> > > because it ensures that any memory allocations get taken from the right
> > > NUMA node immediately.
> > >
> > > Empirically this approach seems to work on Fedora 23 (systemd 222) and
> > > RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
> > > not anticipated here.
> > 
> > Yes, PID 1 was moved to the special scope unit init.scope as mentioned
> > above (in preparation for cgroupsv2 where inner cgroups can never
> > contain PIDs). This is likely going to break then.
> 
> cgroupsv2 is likely to break many things once distros switch over, so
> I assume that wouldn't be done in a minor update - only a major new
> distro release so, not so concerning.

To keep things obvious we also moved PID 1 into init.scope on
cgroupsv1 systems.

Hence, your script might already break as soon as you update to a more
recent systemd version, regardless if cgroupsv1 or cgroupsv2 are used.

> > But again, I have the suspicion that CPUAffinity= might already
> > suffice for you?
> 
> Yep, it looks like it should suffice for most people, unless they also
> wish to have memory node restrictions enforced from boot.

I'd be open to adding some sane subset of numactl as friendly
high-level options to systemd too. As my knowledge of NUMA (and access
to systems with NUMA) is pretty limited, we#d need a contributor patch
for that however.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller

2016-07-20 Thread Daniel P. Berrange
On Wed, Jul 20, 2016 at 03:29:30PM +0200, Lennart Poettering wrote:
> On Wed, 20.07.16 12:53, Daniel P. Berrange (berra...@redhat.com) wrote:
> 
> > For virtualized hosts it is quite common to want to confine all host OS
> > processes to a subset of CPUs/RAM nodes, leaving the rest available for
> > exclusive use by QEMU/KVM.  Historically people have used the "isolcpus"
> > kernel arg todo this, but last year that had its semantics changed, so
> > that any CPUs listed there also get excluded from load balancing by the
> > schedular making it quite useless in general non-real-time use cases
> > where you still want QEMU threads load-balanced across CPUs.
> > 
> > So the only option is to use the cpuset cgroup controller to confine
> > procosses. AFAIK, systemd does not have an explicit support for the cpuset
> > controller at this time, so I'm trying to work out the "optimal" way to
> > achieve this behind systemd's back while minimising the risk that future
> > systemd releases will break things.
> 
> Yes, we don't support this as of now, but we'd like to. The thing
> though is that the kernel interface for it is pretty borked as it is
> right now, and until that's not fixed we are unlikely going to support
> this in systemd. (And as I understood Tejun the mem vs. cpu thing in
> cpuset is probably not going to stay the way it is either)
> 
> But note that the non-cgroup CPUAffinity= setting should be good
> enough for many use cases. Are you sure that isn't sufficient for you?
> 
> Also note that systemd supports setting a system-wide CPUAffinity= for
> itself during early boot, thus leaving all unlisted CPUs free for
> specific services where you use CPUAffinity= to change this default.

Ah, interesting, I didn't notice you could set that globally.


> > The key factor here is use of "Before" to ensure this gets run immediately
> > after systemd switches root out of the initrd, and before /any/ long lived
> > services are run. This lets us set cpuset placement on systemd (pid 1)
> > itself and have that inherited by everything it spawns. I felt this is
> > better than trying to move processes after they have already started,
> > because it ensures that any memory allocations get taken from the right
> > NUMA node immediately.
> >
> > Empirically this approach seems to work on Fedora 23 (systemd 222) and
> > RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
> > not anticipated here.
> 
> Yes, PID 1 was moved to the special scope unit init.scope as mentioned
> above (in preparation for cgroupsv2 where inner cgroups can never
> contain PIDs). This is likely going to break then.

cgroupsv2 is likely to break many things once distros switch over, so
I assume that wouldn't be done in a minor update - only a major new
distro release so, not so concerning.

> But again, I have the suspicion that CPUAffinity= might already
> suffice for you?

Yep, it looks like it should suffice for most people, unless they also
wish to have memory node restrictions enforced from boot.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Confining ALL processes to a CPUs/RAM via cpuset controller

2016-07-20 Thread Lennart Poettering
On Wed, 20.07.16 12:53, Daniel P. Berrange (berra...@redhat.com) wrote:

> For virtualized hosts it is quite common to want to confine all host OS
> processes to a subset of CPUs/RAM nodes, leaving the rest available for
> exclusive use by QEMU/KVM.  Historically people have used the "isolcpus"
> kernel arg todo this, but last year that had its semantics changed, so
> that any CPUs listed there also get excluded from load balancing by the
> schedular making it quite useless in general non-real-time use cases
> where you still want QEMU threads load-balanced across CPUs.
> 
> So the only option is to use the cpuset cgroup controller to confine
> procosses. AFAIK, systemd does not have an explicit support for the cpuset
> controller at this time, so I'm trying to work out the "optimal" way to
> achieve this behind systemd's back while minimising the risk that future
> systemd releases will break things.

Yes, we don't support this as of now, but we'd like to. The thing
though is that the kernel interface for it is pretty borked as it is
right now, and until that's not fixed we are unlikely going to support
this in systemd. (And as I understood Tejun the mem vs. cpu thing in
cpuset is probably not going to stay the way it is either)

But note that the non-cgroup CPUAffinity= setting should be good
enough for many use cases. Are you sure that isn't sufficient for you?

Also note that systemd supports setting a system-wide CPUAffinity= for
itself during early boot, thus leaving all unlisted CPUs free for
specific services where you use CPUAffinity= to change this default.

> As an example I have a host with 3 NUMA nodes, 12 CPUS and want to have
> all non-QEMU processes running on CPUs 0 & 1, leaving 3-11 available
> for QEMU machines
> 
> So far my best solution looks like this:
> 
> $ cat /etc/systemd/system/cpuset.service
> [Unit]
> Description=Restrict CPU placement
> DefaultDependencies=no
> Before=sysinit.target slices.target basic.target lvm2-lvmetad.service 
> systemd-journald.service systemd-udevd.service
> 
> [Service]
> Type=oneshot
> KillMode=none
> RemainAfterExit=yes
> ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice
> ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/machine.slice
> ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > 
> /sys/fs/cgroup/cpuset/system.slice/cpuset.cpus'
> ExecStartPre=/bin/bash -c '/usr/bin/echo "0" > 
> /sys/fs/cgroup/cpuset/system.slice/cpuset.mems'
> ExecStartPre=/bin/bash -c '/usr/bin/echo "3-11" > 
> /sys/fs/cgroup/cpuset/machine.slice/cpuset.cpus'
> ExecStartPre=/bin/bash -c '/usr/bin/echo "0-2" > 
> /sys/fs/cgroup/cpuset/machine.slice/cpuset.mems'
> ExecStartPost=/bin/bash -c '/usr/bin/echo 1 >
> /sys/fs/cgroup/cpuset/system.slice/tasks'

Current systemd versions never place processes in slices, and PID 1
sits in "init.scope" hence.

> The key factor here is use of "Before" to ensure this gets run immediately
> after systemd switches root out of the initrd, and before /any/ long lived
> services are run. This lets us set cpuset placement on systemd (pid 1)
> itself and have that inherited by everything it spawns. I felt this is
> better than trying to move processes after they have already started,
> because it ensures that any memory allocations get taken from the right
> NUMA node immediately.
>
> Empirically this approach seems to work on Fedora 23 (systemd 222) and
> RHEL 7 (systemd 219), but I'm wondering if there's any pitfalls that I've
> not anticipated here.

Yes, PID 1 was moved to the special scope unit init.scope as mentioned
above (in preparation for cgroupsv2 where inner cgroups can never
contain PIDs). This is likely going to break then.

But again, I have the suspicion that CPUAffinity= might already
suffice for you?

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel