Am 01.04.2014 01:49, schrieb Dwight Engen: > On Mon, 31 Mar 2014 23:18:13 +0200 > Florian Klink <[email protected]> wrote: > >> Am 31.03.2014 21:13, schrieb Dwight Engen: >>> On Mon, 31 Mar 2014 20:34:15 +0200 >>> Florian Klink <[email protected]> wrote: >>> >>>> Am 31.03.2014 20:10, schrieb Dwight Engen: >>>>> On Sat, 29 Mar 2014 23:39:33 +0100 >>>>> Florian Klink <[email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> when running multiple lxc actions in row using the command line >>>>>> tools, I sometimes observe the following state: >>>>>> >>>>>> >>>>>> - lxc-monitord is not running anymore >>>>>> - /run/lxc/var/lib/lxc/monitor-fifo still exists, but is >>>>>> "refusing connection" >>>>>> >>>>>> In the logs, I then see the following: >>>>>> >>>>>> >>>>>> lxc-start 1395671045.703 ERROR lxc_monitor - connect : backing >>>>>> off 10 lxc-start 1395671045.713 ERROR lxc_monitor - connect : >>>>>> backing off 50 lxc-start 1395671045.763 ERROR lxc_monitor - >>>>>> connect : backing off 100 lxc-start 1395671045.864 ERROR >>>>>> lxc_monitor - connect : Connection refused >>>>>> >>>>>> >>>>>> ... and the command fails. >>>>> >>>>> The only time I've seen this happen is if lxc-monitord is hard >>>>> killed so it doesn't have a chance to clean up and remove the >>>>> socket. >>>> >>>> Here, it's happening quite frequently. However, the script never >>>> kills lxc-monitord on its own, it just tries to detect and fix >>>> this state by removing the socket file... >>> >>> Right, removing the socket file makes it so another lxc-monitord >>> will start, but the question is why is the first one exiting without >>> cleaning up? Can you reliably reproduce it at will? If so then maybe >>> you could attach an strace to lxc-monitord and see why it is >>> exiting. >> >> I was so far not successful in reproducing the bug while having an >> strace running. :-( But I'll continue to try!
Success :-) I managed to get an strace while trying to reproduce the bug. I gzipped and attached it to this mail. Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord /var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt I fired a bunch of lxc-starts and lxc-stops in row, then stopped my script and waited for lxc-monitord (and strace too) to stop. Then I started my script again and had the "leftover monitor-fifo state". >>> >>>>> >>>>>> >>>>>> A possible workaround would be checking for non-running >>>>>> lxc-monitord process but existing monitor-fifo file then removing >>>>>> the fifo if it exists before running the next lxc command, but >>>>>> thats ugly ;-) >>>>> >>>>> Is there a good non-racy way to do this? I guess monitord could >>>>> write its pid in $LXCPATH and we could kill(pid, 0) it. >> >> I also think that lxc should be able to recover from this problem >> automatically. > > I agree, though I would like to understand the root cause. Can you try > out the attached patch? I think it will cure your issues. > Thanks for the patch! Just tell me if you need more information for the strace above. If not, I'll happily apply the patch :-) >>>>> >>>>>> Is this behaviour known? Is there some missing "cleanup code" in >>>>>> lxc(_monitord) or why is it failing like this? >>>>> >>>>> Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and >>>>> cleans up. Other than hard kill I'm not sure what else might >>>>> cause it to exit without cleaning up. >>>> >>>> I shutdown containers with `lxc-stop -n container-name` >>>> (lxc.stopsignal=30 (SIGPWR)), however this signal should never go >>>> to lxc_monitord, right? >>> >>> Right, that goes to the init process of the container.
strace_output.txt.gz
Description: application/gzip
signature.asc
Description: OpenPGP digital signature
_______________________________________________ lxc-users mailing list [email protected] http://lists.linuxcontainers.org/listinfo/lxc-users
