On Tue, 01 Apr 2014 22:15:25 +0200 Florian Klink <flo...@flokli.de> wrote:
> Am 01.04.2014 01:49, schrieb Dwight Engen: > > On Mon, 31 Mar 2014 23:18:13 +0200 > > Florian Klink <flo...@flokli.de> wrote: > > > >> Am 31.03.2014 21:13, schrieb Dwight Engen: > >>> On Mon, 31 Mar 2014 20:34:15 +0200 > >>> Florian Klink <flo...@flokli.de> wrote: > >>> > >>>> Am 31.03.2014 20:10, schrieb Dwight Engen: > >>>>> On Sat, 29 Mar 2014 23:39:33 +0100 > >>>>> Florian Klink <flo...@flokli.de> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> when running multiple lxc actions in row using the command line > >>>>>> tools, I sometimes observe the following state: > >>>>>> > >>>>>> > >>>>>> - lxc-monitord is not running anymore > >>>>>> - /run/lxc/var/lib/lxc/monitor-fifo still exists, but is > >>>>>> "refusing connection" > >>>>>> > >>>>>> In the logs, I then see the following: > >>>>>> > >>>>>> > >>>>>> lxc-start 1395671045.703 ERROR lxc_monitor - connect : > >>>>>> backing off 10 lxc-start 1395671045.713 ERROR lxc_monitor - > >>>>>> connect : backing off 50 lxc-start 1395671045.763 ERROR > >>>>>> lxc_monitor - connect : backing off 100 lxc-start > >>>>>> 1395671045.864 ERROR lxc_monitor - connect : Connection refused > >>>>>> > >>>>>> > >>>>>> ... and the command fails. > >>>>> > >>>>> The only time I've seen this happen is if lxc-monitord is hard > >>>>> killed so it doesn't have a chance to clean up and remove the > >>>>> socket. > >>>> > >>>> Here, it's happening quite frequently. However, the script never > >>>> kills lxc-monitord on its own, it just tries to detect and fix > >>>> this state by removing the socket file... > >>> > >>> Right, removing the socket file makes it so another lxc-monitord > >>> will start, but the question is why is the first one exiting > >>> without cleaning up? Can you reliably reproduce it at will? If so > >>> then maybe you could attach an strace to lxc-monitord and see why > >>> it is exiting. > >> > >> I was so far not successful in reproducing the bug while having an > >> strace running. :-( But I'll continue to try! > > Success :-) I managed to get an strace while trying to reproduce the > bug. I gzipped and attached it to this mail. > > Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord > /var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt > > I fired a bunch of lxc-starts and lxc-stops in row, then stopped my > script and waited for lxc-monitord (and strace too) to stop. > > Then I started my script again and had the "leftover monitor-fifo > state". Unfortunately, I don't think that strace shows the problem. It looks to me like a normal exit with a successful unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end. You can't really run monitord by hand like that since it is expecting a pipe fd as argv[2]. Thats why I was suggesting attaching to it. So something like: lxc-start <your ct> lxc-monitor -n '.*' in another terminal: ps aux |grep monitord -> find the pid of lxc-monitord strace -v -t -o straceout.txt -p <pid of monitord> and then do whatever you do to make things fail :) > >>> > >>>>> > >>>>>> > >>>>>> A possible workaround would be checking for non-running > >>>>>> lxc-monitord process but existing monitor-fifo file then > >>>>>> removing the fifo if it exists before running the next lxc > >>>>>> command, but thats ugly ;-) > >>>>> > >>>>> Is there a good non-racy way to do this? I guess monitord could > >>>>> write its pid in $LXCPATH and we could kill(pid, 0) it. > >> > >> I also think that lxc should be able to recover from this problem > >> automatically. > > > > I agree, though I would like to understand the root cause. Can you > > try out the attached patch? I think it will cure your issues. > > > > Thanks for the patch! Just tell me if you need more information for > the strace above. If not, I'll happily apply the patch :-) You can try the patch to see if it solves your issue, though I'd still like to understand why its happening in the first place. I may rework the patch based on Serge's suggestion, but it'd be nice to know if the one I sent does fix what you are seeing. It worked for all the hard-kill cases I tried. > >>>>> > >>>>>> Is this behaviour known? Is there some missing "cleanup code" > >>>>>> in lxc(_monitord) or why is it failing like this? > >>>>> > >>>>> Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and > >>>>> cleans up. Other than hard kill I'm not sure what else might > >>>>> cause it to exit without cleaning up. > >>>> > >>>> I shutdown containers with `lxc-stop -n container-name` > >>>> (lxc.stopsignal=30 (SIGPWR)), however this signal should never go > >>>> to lxc_monitord, right? > >>> > >>> Right, that goes to the init process of the container. > _______________________________________________ lxc-users mailing list lxc-users@lists.linuxcontainers.org http://lists.linuxcontainers.org/listinfo/lxc-users