Hello, some of our users encounter a strange issue when using lxc-freeze on a container using lxcfs. Sometimes, lxc-freeze is unable to freeze a process inside the container that is accessing files in /proc that are provided by lxcfs. The process(es) in question hang in FUSE's request_wait_answer(), and the associated lxcfs process in futex_wait_queue_me (according to ps faxl).
This is quite surprising, because lxcfs is not part of the cgroup that is frozen, and should thus not be affected by a call to lxc-freeze. A similar, but NOT unsurprising behaviour can be observed when mounting a FUSE file system in the container itself (e.g., create /dev/fuse and mount an sshfs inside the CT), running find in a loop on the mounted FUSE fs in the container and trying to lxc-freeze the container. In that case, the problem is that the kernel freezer does not know in which order the processes would need to be frozen in order to avoid a deadlock. I don't see how this would apply to lxcfs (running on the host) and a process accessing it (in the container) though. A test setup that seems to work (but takes a while to trigger): 1) Log into container and do: $ while : ; do uptime; done 2) On host do: $ i=0; while : ; do let i++; echo freeze $i && lxc-freeze -n NAME; echo unfreeze && lxc-unfreeze -n NAME; done At some point, the output of 2 will stop, and 'ps faxl' will show something like this: # ps faxl |grep lxcfs 4 0 3774 1 20 0 527956 2132 futex_wait_queue_me Ssl ? 0:10 /usr/bin/lxcfs -f -s -o allow_other /var/lib/lxcfs/ 5 0 22927 3774 20 0 380220 788 wait S ? 0:00 \_ /usr/bin/lxcfs -f -s -o allow_other /var/lib/lxcfs/ 1 0 22928 22927 20 0 380352 788 futex_wait_queue_me S ? 0:00 \_ /usr/bin/lxcfs -f -s -o allow_other /var/lib/lxcfs/ # (ps faxl portion for the container, no lxc-attach was used so this includes all of it) 5 0 12569 1 20 0 38768 3448 ep_poll Ss ? 0:02 [lxc monitor] /var/lib/lxc 104 4 0 12651 12569 20 0 34080 4492 refrigerator Ds ? 0:00 \_ /sbin/init 4 0 12815 12651 20 0 30488 5436 refrigerator Ds ? 0:00 \_ /usr/lib/systemd/systemd-journald 4 81 12981 12651 20 0 34748 3444 refrigerator Ds ? 0:00 \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation 4 0 13016 12651 20 0 15292 2424 refrigerator Ds ? 0:00 \_ /usr/lib/systemd/systemd-logind 4 193 13033 12651 20 0 19792 2688 refrigerator Ds ? 0:00 \_ /usr/lib/systemd/systemd-networkd 4 0 13052 12651 20 0 6348 1664 refrigerator Ds+ pts/7 0:00 \_ /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt220 4 0 13055 12651 20 0 6348 1544 refrigerator Ds+ pts/1 0:00 \_ /sbin/agetty --noclear --keep-baud pts/1 115200 38400 9600 vt220 4 0 13058 12651 20 0 89728 4128 refrigerator Ds ? 0:00 \_ login -- root 4 0 30296 13058 20 0 14408 3356 refrigerator Ds pts/0 0:01 | \_ -bash 0 0 22921 30296 20 0 31980 2380 request_wait_answer D+ pts/0 0:00 | \_ uptime 4 0 30127 12651 20 0 33752 4128 refrigerator Ds ? 0:00 \_ /usr/lib/systemd/systemd --user 5 0 30159 30127 20 0 96432 1316 sigtimedwait S ? 0:00 \_ (sd-pam) Attaching gdb to the lxcfs process in question (22928 in this case) gives the following (trimmed) backtrace: #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f9552b816db in _L_lock_11305 () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f9552b7f838 in __GI___libc_realloc (oldmem=0x7f9552ea8620 <main_arena>, bytes=bytes@entry=567) at malloc.c:3025 See [1] for full backtrace. It seems that a fork() gone wrong fails an assertion and the malloc() needed to asprintf the error message waits for a lock? Calling lxc-unfreeze -n NAME makes both the container and lxcfs continue without problems, a subsequent lxc-freeze -n NAME works (since it took >6000 freeze attempts to trigger the issue with this setup, this is not surprising). While it takes a while to reproduce this in this test setting, our users report that it occurs quite often in a "real" environment. Some common factors seem to be: running multiple containers, running some kind of monitoring software accessing various /proc files in the container (we have reports concerning piwik, splunkd and monit). See [2] for a support forum thread with reports of varying detail, and hopefully more backtraces soon. Note that Proxmox VE calls lxc-freeze for both snapshot and suspend mode backups, so this issue affects both modes. Thanks in advance for checking this out, Fabian 1: https://gist.githubusercontent.com/Blub/72a7f432fcf8f6513919/raw/cbc22497abd95746dbb426b0674572c7ffef6a07/lxc-err1.txt 2: https://forum.proxmox.com/threads/lxc-backup-randomly-hangs-at-suspend.25345/ _______________________________________________ lxc-devel mailing list lxc-devel@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-devel