Hello, Guix!
This is my second posting to the mailing list but the first using Gnus and smtmpmail. If I've formatted anything poorly, don't hesitate to let me know. I've been spending a silly amount of time trying to get a local flavor of Kubernetes running on Guix System. I wanted to share my experience and also solicit feedback from Guix's developers on how to improve the cgroups implementation such that those who follow me will have an easier time of it. I wish to start by stating that I am largely a Linux enthusiast. Most of my knowledge of cgroups I owe to reading over the last two weeks. If I state something as true and I've gotten it wrong, please don't hesitate to correct me (kindly). With that, here come the statements as I understand them to be true. Most flavors of local Kubernetes are expecting systemd, which presents some unusual challenges for Guix System users, especially when using Podman rootlessly to run a local Kubernetes cluster, which is my use-case. As I understand it, systemd creates user "slices", which kind and minikube then map cgroups to. Patch 64260 added support for cgroups v2, a necessary requirement for Podman to run rootless containers and rootless Kubernetes clusters. However, because we don't make use of systemd and therefore assigned user slices, our /sys/fs/cgroups looks like this: ls -lah /sys/fs/cgroup/ total 0 dr-xr-xr-x 7 root root 0 Sep 24 13:09 . drwxr-xr-x 8 root root 0 Sep 24 13:09 .. drwxr-xr-x 2 root root 0 Sep 24 13:09 c1 drwxr-xr-x 2 root root 0 Sep 24 13:09 c2 drwxr-xr-x 2 root root 0 Sep 24 16:26 c3 drwxr-xr-x 2 root root 0 Sep 24 16:26 c4 -r--r--r-- 1 root root 0 Sep 24 13:09 cgroup.controllers -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.max.depth -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.max.descendants -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.pressure -rw-r--r-- 1 root root 0 Sep 24 13:09 cgroup.procs -r--r--r-- 1 root root 0 Sep 24 18:07 cgroup.stat -rw-r--r-- 1 root root 0 Sep 24 18:06 cgroup.subtree_control -rw-r--r-- 1 root root 0 Sep 24 18:07 cgroup.threads -rw-r--r-- 1 root root 0 Sep 24 18:07 cpu.pressure -r--r--r-- 1 root root 0 Sep 24 18:07 cpuset.cpus.effective -r--r--r-- 1 root root 0 Sep 24 18:07 cpuset.mems.effective -r--r--r-- 1 root root 0 Sep 24 18:07 cpu.stat dr-xr-xr-x 2 root root 0 Sep 24 13:09 elogind -rw-r--r-- 1 root root 0 Sep 24 18:07 io.cost.model -rw-r--r-- 1 root root 0 Sep 24 18:07 io.cost.qos -rw-r--r-- 1 root root 0 Sep 24 18:07 io.pressure -rw-r--r-- 1 root root 0 Sep 24 18:07 io.prio.class -r--r--r-- 1 root root 0 Sep 24 18:07 io.stat -r--r--r-- 1 root root 0 Sep 24 18:07 memory.numa_stat -rw-r--r-- 1 root root 0 Sep 24 18:07 memory.pressure --w------- 1 root root 0 Sep 24 18:07 memory.reclaim -r--r--r-- 1 root root 0 Sep 24 18:07 memory.stat -r--r--r-- 1 root root 0 Sep 24 18:07 misc.capacity You may notice the first problem, which is that the entire tree is owned by root. kind and minikube don't like this: 2023-09-23T23:33:41.974998799+02:00 Failed to create /init.scope control group: Permission denied 2023-09-23T23:33:41.974998799+02:00 Failed to allocate manager object: Permission denied 2023-09-23T23:33:41.974998799+02:00 [!!!!!!] Failed to allocate manager object. 2023-09-23T23:33:41.974998799+02:00 Exiting PID 1...: container exited unexpectedly The second problem is kind and minikube are both expecting Delegate=yes to be set, which is a systemd function that allows these tools to set cgroups limits. The limits it's expecting to control are cpu, cpuset, memory and pids. We can force these privileges like so, echo "+cpu +cpuset +memory +pids" >> /sys/fs/cgroup/cgroup.subtree_control To fix the first problem we can run g=users && sudo chgrp -R ${g} /sys/fs/cgroup/ u=$USER && sudo chown -R ${u}: /sys/fs/cgroup These aren't harmful actions since all we're doing is changing the cgroups file tree to be owned by our users and its users group. Once we've addressed the first and second problem, the rest is relatively easy: we need to make iptables (and iptables' modules so just the package isn't enough: we need Guix's service) available. We need to set a range of user IDs and group IDs for Podman to make use of rootlessly, and finally we need to set a container policy otherwise Podman can't pull any image from anywhere. All of those can be done from inside our Guix System configuration file. What I'd really like to see is some method for declaratively changing the cgroups file-tree and setting limit delegation, since otherwise these actions need to be done on every boot. I don't have the Guile skills to pull this off but if someone fancied mentoring me, I'd be happy to give it a shot. I have just enough ability to cobble together a kind package from a binary (for shame, I know) and to edit the EXWM upstream package to be based on a newer Emacs release version. Otherwise, if there's a method of declaring these already available or someone else can take a crack at this, please let me know! Here's what that Guix System configuration looks like: ;; Rootless Podman requires the next 4 services ;; we're using the iptables service purely to make its resources ;; available to minikube and kind (service iptables-service-type (iptables-configuration (ipv4-rules (plain-file "iptables.rules" "*filter :INPUT ACCEPT :FORWARD ACCEPT :OUTPUT ACCEPT COMMIT ")) (ipv6-rules (plain-file "ip6tables.rules" "*filter :INPUT ACCEPT :FORWARD ACCEPT :OUTPUT ACCEPT COMMIT ")))) (simple-service 'etc-subuid etc-service-type (list `("subuid" ,(plain-file "subuid" (string-append "root:0:65536\n" username ":100000:65536\n"))))) (simple-service 'etc-subgid etc-service-type (list `("subgid" ,(plain-file "subgid" (string-append "root:0:65536\n" username ":100000:65536\n"))))) (service pam-limits-service-type (list (pam-limits-entry "*" 'both 'nofile 100000))) (simple-service 'etc-container-policy etc-service-type (list `("containers/policy.json", (plain-file "policy.json" "{\"default\": [{\"type\": \"insecureAcceptAnything\"}]}")))) %my-services