Hi, On Wed, Sep 16, 2020 at 11:14:58PM +0200, Pierre-Elliott Bécue wrote: > Le mercredi 16 septembre 2020 à 22:08:24+0200, Pierre-Elliott Bécue a écrit : > > Le vendredi 11 septembre 2020 à 11:12:25+0200, Iñaki Malerba a écrit : > > > Hi Pebs, > > > > > > Thanks for checking this. > > > > > > On Sat, 5 Sep 2020 23:23:30 +0200 Pierre-Elliott =?utf-8?B?QsOpY3Vl?= > > > <[email protected]> wrote:> > > > > LXC's devs told me that 4.0.4 should solve it. I'm uploading this > > > > release now. Please don't hesitate to tell me if it helps. > > > > > > Run a pipeline removing the pinning of lxc, and the behaviour seems to > > > be the same. > > > > > > Image building: > > > https://salsa.debian.org/ina/pipeline/-/jobs/990332 > > > > Setting up lxc (1:4.0.4-1) .. > > > > > > Running lxc: > > > https://salsa.debian.org/ina/pipeline/-/jobs/990352 > > > > <VirtSubproc>: failure: ['sudo', 'timeout', '600', 'lxc-stop', > > > '--quiet', '--kill', '--name', 'ci-254-b2fcad5f'] failed (exit status 1, > > > stderr '') > > > > > > Please let me know if you want us to test something else. > > > > > > Abrazos, > > > > Could you get me a full trace like the previous time? I have no > > technical means of running proper tests currently, sorry. :/ > > > > Cheers! > > I found a way to run tests on my own. > > Turns out I tried to add a lxc-attach autopkgtest-stable-amd64 -- ps > auxf to see the process tree in case I could find something useful, > and… the container successfully stopped that time. I retried and it kept > working. > > The process tree I see is: > ─( 23:09:35 )─< /home/becue/tmp > >───────────────────────────────────────────────[ 0 ]─ > root@dawaj # docker run --rm --privileged -i autopkgtest > Starting LXC network bridge: :Starting LXC autoboot containers: :USER > PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 4 0.0 0.0 7644 2760 ? R 21:09 0:00 ps auxf > root 1 0.0 0.0 20904 7492 ? Ds 21:09 0:00 /sbin/init > ok > > After some more tests, it seems that lxc-start && lxc-stop isn't working > properly because the signal is sent before the container is ready to > handle it. > > After this test I decided to add a sleep 2 before the lxc-attach ... -- > ps command: > > Starting LXC network bridge: :Starting LXC autoboot containers: :USER > PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 52 0.0 0.0 7644 2768 ? R 21:10 0:00 ps auxf > root 1 2.5 0.1 21524 9596 ? Ss 21:10 0:00 /sbin/init > root 17 0.5 0.1 27444 8404 ? Ss 21:10 0:00 > /lib/systemd/systemd-journald > root 27 0.0 0.0 2348 1772 ? Ss 21:10 0:00 /sbin/ifup > -a --read-environment > root 42 0.0 0.0 2392 764 ? S 21:10 0:00 \_ > /bin/sh -c /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf > /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases > eth0 . > root 43 0.0 0.0 8456 1936 ? S 21:10 0:00 \_ > /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf > /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases > eth0 > root 44 0.0 0.0 9492 5644 ? S 21:10 0:00 \_ > /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf > /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases > eth0 > message+ 50 0.0 0.0 8696 3636 ? Ss 21:10 0:00 > /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile > --systemd-activation --syslog-only > root 51 0.5 0.0 19308 6376 ? Ss 21:10 0:00 > /lib/systemd/systemd-logind > ok > > Turns out your lxc-stop is really fast, and therefore, not catched > properly by LXC. > > While I appreciate it shouldn't be a corner case that makes things > explode, do you think there's a way to take this realization into > account to lower the severity of this bug, having a temporary fix set up > in place? > > I'll still try to see what upstream could offer to handle this in a > better way.
I did a little bit of debugging on this today. I think the provided
Dockerfile does not really reproduce the real issue. The lxc-stop call
from lxc comes after the tests are run, so plenty of time after the
lxc-start call.
I installed autopkgtest inside the container, did manually the mount
steps listed in the Dockerfile, and tried it directly. Failed as
expected:
root@665d38b2f9e3:~/pkg# autopkgtest -B . -- lxc autopkgtest-stable-amd64
autopkgtest [12:41:01]: starting date: 2020-09-17
autopkgtest [12:41:01]: version 5.14
autopkgtest [12:41:01]: host 665d38b2f9e3; command line:
/usr/bin/autopkgtest -B . -- lxc autopkgtest-stable-amd64
autopkgtest [12:41:07]: testbed dpkg architecture: amd64
autopkgtest [12:41:08]: testbed running kernel: Linux 5.8.0-1-amd64 #1 SMP
Debian 5.8.7-1 (2020-09-05)
autopkgtest [12:41:08]: @@@@@@@@@@@@@@@@@@@@ built-tree .
autopkgtest [12:41:08]: test command1: preparing testbed
autopkgtest [12:41:08]: test command1: true
autopkgtest [12:41:08]: test command1: [-----------------------
autopkgtest [12:41:09]: test command1: -----------------------]
autopkgtest [12:41:09]: test command1: - - - - - - - - - - results - - - -
- - - - - -
command1 PASS
autopkgtest [12:41:09]: @@@@@@@@@@@@@@@@@@@@ summary
command1 PASS
<VirtSubproc>: failure: ['lxc-stop', '--quiet', '--kill', '--name',
'autopkgtest-lxc-jodrho'] failed (exit status 1, stderr '')
autopkgtest [12:41:09]: ERROR: autopkgtest
: error cleaning up:
autopkgtest [12:41:09]: ERROR: testbed failure: cannot send to testbed: [Errno
32] Broken pipe
Note however the despite lxc-stop call returning non-zero, the container
is stopped. I then tried starting the container again, and stopping it
by hand. Failed as well:
root@665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho
root@665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c
'sleep 1 && echo HI'
HI
root@665d38b2f9e3:~/pkg# lxc-stop --kill --name autopkgtest-lxc-jodrho
lxc-stop: autopkgtest-lxc-jodrho: commands.c: lxc_cmd_stop: 775 No such
file or directory - Failed to stop container "autopkgtest-lxc-jodrho"
Note however that this time I omitted the --quiet flag that autopkgtest
uses, which gives us at least some extra info. No such file of
directory.
I tried lxc-stop with the --kill flag:
root@665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho
root@665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c
'sleep 1 && echo HI'
HI
root@665d38b2f9e3:~/pkg# lxc-stop --name autopkgtest-lxc-jodrho
Works!
So this seems related to the --kill flag.
I then tried to strace the lxc-stop --kill call. By the end of the log,
there is this:
openat(AT_FDCWD, "/lxc/autopkgtest-lxc-jodrho/partial", O_RDWR|O_CLOEXEC) =
-1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX,
sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
getpid() = 3567
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\232\r\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX,
sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
getpid() = 3567
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX,
sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
getpid() = 3567
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
close(3) = 0
write(2, "lxc-stop: autopkgtest-lxc-jodrho"..., 34lxc-stop:
autopkgtest-lxc-jodrho: ) = 34
write(2, "commands.c: lxc_cmd_stop: 775 ", 30commands.c: lxc_cmd_stop: 775
) = 30
write(2, "No such file or directory - Fail"..., 77No such file or directory
- Failed to stop container "autopkgtest-lxc-jodrho") = 77
write(2, "\n", 1
) = 1
exit_group(1) = ?
+++ exited with 1 +++
First I though that it could be related to /lxc/autopkgtest-lxc-jodrho/partial,
but strace'ing a lxc-stop call without --kill, there is a similar error, so
that's not it. For comparison, the end of the logs in that case look like this:
openat(AT_FDCWD, "/lxc/test/partial", O_RDWR|O_CLOEXEC) = -1 ENOENT (No
such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], msg_controllen=24,
msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}],
msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
close(3) = 0
openat(AT_FDCWD, "/proc/3852/status", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "Name:\tsystemd\nUmask:\t0000\nState:"..., 1024) = 1024
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\n\0\0\0
\0\0\0\320\210s\352\376\177\0\0", iov_len=16}], msg_iovlen=1,
msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS,
cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0},
MSG_NOSIGNAL) = 16
sendto(3,
"\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
MSG_NOSIGNAL, NULL, 0) = 32
recvmsg(3, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\10\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0", iov_len=16}],
msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 16
pidfd_send_signal(4, SIGRT_5, NULL, 0) = 0
poll([{fd=4, events=POLLIN}], 1, 60000) = 1 ([{fd=4, revents=POLLIN}])
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD,
"<\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
recvfrom(3, "", 264, 0, NULL, NULL) = 0
close(3) = 0
close(4) = 0
exit_group(0) = ?
+++ exited with 0 +++
I just ran out of time. Maybe the above is useful for someone else to
figure the issue out.
signature.asc
Description: PGP signature

