Hi @proxmox, Since some months we are experiencing frequent 'hang' situations on our proxmox nodes.
Today, again, such a situation occured. So we took some time to look at the situation on hand. The situation 'started' when we did a pct start 1310 This did not return. And looking at the process list showed that we had this: 21462 ? Ss 0:00 /usr/bin/lxc-start -n 1310 21619 ? Z 0:00 \_ [lxc-start] <defunct> 21758 ? Ss 0:00 [lxc monitor] /var/lib/lxc 1310 24681 ? D 0:00 \_ [lxc monitor] /var/lib/lxc 1310 situation. When looking at the wait-channel, the namespaces and the stack of 24681 we noticed that it was blocked in [<0>] copy_net_ns+0x After some more searching, we found with grep copy_net_ns /proc/[0-9]*/stack that there where 2 more processes also blocked on copy_net_ns. These where two ionclean processes in other containers. Killing them (with -9) showed that restarted ionclean processes immediatly blocked again on copy_net_ns. The system on which proxmox is running has 2 Intel(R) Xeon(R) CPU E5-2690 v4 CPU's with 14 cores and 28 threads. In proxmox with multithreading this shows as 56 cpu's. So real concurrency is possible. The problem seems like a race condition on some resource. But killing (with -9) all the processes that are hanging on copy_net_ns does not make the kernel release the contented resource. After killing all the processes on copy_net_ns and with no process having a stack showing copy_net_ns, starting a new container immediately blocks again on copy_net_ns. So only a reboot (as far as we know) solves this. We played around with ip li set netns, on the veth devices, etc. but we could not get the machine out of this situation in any way other then reboot. Based on all this we found that in https://github.com/lxc/lxd/issues/4468 it says that this problem should be solved in kernel 4.17. We run the latest proxmox enterprise updates on this machine and it's kernel is PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100) As the kernel is ubuntu based would it be possible to start using the ubuntu 18.10 kernel which is 4.18 to get around this problem? -- Kind regards, Stephan Leemburg IT Functions _______________________________________________ pve-user mailing list [email protected] https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
