On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
>> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <pet...@infradead.org> wrote:
>> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
>> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
>> > > no-internal-tasks constraints. Do exclusive cgroups still exist in
>> > > cgroup2? Could we perhaps just remove that capability entirely? I've
>> > > never understood what problem exlusive cpusets and such solve that
>> > > can't be more comprehensibly solved by just assigning the cpusets the
>> > > normal inclusive way.
>> > Without exclusive sets we cannot split the sched_domain structure.
>> > Which leads to not being able to actually partition things. That would
>> > break DL for one.
>> Can you sketch out a toy example?
> [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]
> mkdir /cpuset
> mount -t cgroup -o cpuset none /cpuset
> mkdir /cpuset/A
> mkdir /cpuset/B
> cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
> echo 0 > /cpuset/A/cpuset.mems
> cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
> echo 1 > /cpuset/B/cpuset.mems
> # move all movable tasks into A
> cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done
> # kill machine wide load-balancing
> echo 0 > /cpuset/cpuset.sched_load_balance
> # now place 'special' tasks in B
> This partitions the scheduler into two, one for each node.
> Hereafter no task will be moved from one node to another. The
> load-balancer is split in two, one balances in A one balances in B
> nothing crosses. (It is important that A.cpus and B.cpus do not
> Ideally no task would remain in the root group, back in the day we could
> actually do this (with exception of the cpu bound kernel threads), but
> this has significantly regressed :-(
> (still hate the workqueue affinity interface)
I wonder if we could address this by creating (automatically at boot
or when the cpuset controller is enabled or whatever) a
/cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks
> As is, tasks that are left in the root group get balanced within
> whatever domain they ended up in.
>> And what's DL?
> SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> CPU affinities (because that doesn't make sense). The only way to
> restrict it is to partition.
> 'Global' because you can partition it. If you reduce your system to
> single CPU partitions you'll reduce to P-EDF.
> (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> partition scheme, it however does support sched_affinity, but using it
> gives 'interesting' schedulability results -- call it a historic
Hmm, I didn't realize that the deadline scheduler was global. But
ISTM requiring the use of "exclusive" to get this working is
unfortunate. What if a user wants two separate partitions, one using
CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
non-RT stuff)? Shouldn't we be able to have a cgroup for each of the
DL partitions and do something to tell the deadline scheduler "here is
> Note that related, but differently, we have the isolcpus boot parameter
> which creates single CPU partitions for all listed CPUs and gives the
> rest to the root cpuset. Ideally we'd kill this option given its a boot
> time setting (for something which is trivially to do at runtime).
> But this cannot be done, because that would mean we'd have to start with
> a !0 cpuset layout:
> / \
> 'system' 'isolated'
> cpus=~isolcpus cpus=isolcpus
> And start with _everything_ in the /system group (inclding default IRQ
> Of course, that will break everything cgroup :-(
I would actually *much* prefer this over the status quo. I'm tired of
my crappy, partially-working script that sits there and creates
exactly this configuration (minus the isolcpus part because I actually
want migration to work) on boot. (Actually, it could have two
automatic cgroups: /kernel and /init -- init and UMH would go in init
and kernel threads and such would go in /kernel. Userspace would be
able to request that a different cgroup be used for newly-created
Heck, even systemd would probably prefer this. Then it could cleanly
expose a "slice" or whatever it's called for random kernel shit and at
least you could configure it meaningfully.