On Fri, 04 Apr 2014 22:22:05 +0200 Florian Klink <[email protected]> wrote:
> Am 02.04.2014 16:42, schrieb Dwight Engen: > > On Tue, 01 Apr 2014 22:15:25 +0200 > > Florian Klink <[email protected]> wrote: > > > >> Am 01.04.2014 01:49, schrieb Dwight Engen: > >>> On Mon, 31 Mar 2014 23:18:13 +0200 > >>> Florian Klink <[email protected]> wrote: > >>> > >>>> Am 31.03.2014 21:13, schrieb Dwight Engen: > >>>>> On Mon, 31 Mar 2014 20:34:15 +0200 > >>>>> Florian Klink <[email protected]> wrote: > >>>>> > >>>>>> Am 31.03.2014 20:10, schrieb Dwight Engen: > >>>>>>> On Sat, 29 Mar 2014 23:39:33 +0100 > >>>>>>> Florian Klink <[email protected]> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> when running multiple lxc actions in row using the command > >>>>>>>> line tools, I sometimes observe the following state: > >>>>>>>> > >>>>>>>> > >>>>>>>> - lxc-monitord is not running anymore > >>>>>>>> - /run/lxc/var/lib/lxc/monitor-fifo still exists, but is > >>>>>>>> "refusing connection" > >>>>>>>> > >>>>>>>> In the logs, I then see the following: > >>>>>>>> > >>>>>>>> > >>>>>>>> lxc-start 1395671045.703 ERROR lxc_monitor - connect : > >>>>>>>> backing off 10 lxc-start 1395671045.713 ERROR lxc_monitor > >>>>>>>> - connect : backing off 50 lxc-start 1395671045.763 ERROR > >>>>>>>> lxc_monitor - connect : backing off 100 lxc-start > >>>>>>>> 1395671045.864 ERROR lxc_monitor - connect : Connection > >>>>>>>> refused > >>>>>>>> > >>>>>>>> > >>>>>>>> ... and the command fails. > >>>>>>> > >>>>>>> The only time I've seen this happen is if lxc-monitord is hard > >>>>>>> killed so it doesn't have a chance to clean up and remove the > >>>>>>> socket. > >>>>>> > >>>>>> Here, it's happening quite frequently. However, the script > >>>>>> never kills lxc-monitord on its own, it just tries to detect > >>>>>> and fix this state by removing the socket file... > >>>>> > >>>>> Right, removing the socket file makes it so another lxc-monitord > >>>>> will start, but the question is why is the first one exiting > >>>>> without cleaning up? Can you reliably reproduce it at will? If > >>>>> so then maybe you could attach an strace to lxc-monitord and > >>>>> see why it is exiting. > >>>> > >>>> I was so far not successful in reproducing the bug while having > >>>> an strace running. :-( But I'll continue to try! > >> > >> Success :-) I managed to get an strace while trying to reproduce > >> the bug. I gzipped and attached it to this mail. > >> > >> Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord > >> /var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt > >> > >> I fired a bunch of lxc-starts and lxc-stops in row, then stopped my > >> script and waited for lxc-monitord (and strace too) to stop. > >> > >> Then I started my script again and had the "leftover monitor-fifo > >> state". > > > > Unfortunately, I don't think that strace shows the problem. It > > looks to me like a normal exit with a successful > > unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end. > > > > You can't really run monitord by hand like that since it is > > expecting a pipe fd as argv[2]. Thats why I was suggesting > > attaching to it. So something like: > > > > lxc-start <your ct> > > lxc-monitor -n '.*' > > > > in another terminal: > > ps aux |grep monitord -> find the pid of lxc-monitord > > strace -v -t -o straceout.txt -p <pid of monitord> > > > > and then do whatever you do to make things fail :) > > I was not able to get an strace of the bug. I think was is only > triggered by a lot of lxc-monitord start/stop traffic ;-) > > > > >>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> A possible workaround would be checking for non-running > >>>>>>>> lxc-monitord process but existing monitor-fifo file then > >>>>>>>> removing the fifo if it exists before running the next lxc > >>>>>>>> command, but thats ugly ;-) > >>>>>>> > >>>>>>> Is there a good non-racy way to do this? I guess monitord > >>>>>>> could write its pid in $LXCPATH and we could kill(pid, 0) it. > >>>> > >>>> I also think that lxc should be able to recover from this problem > >>>> automatically. > >>> > >>> I agree, though I would like to understand the root cause. Can you > >>> try out the attached patch? I think it will cure your issues. > >>> > >> > >> Thanks for the patch! Just tell me if you need more information for > >> the strace above. If not, I'll happily apply the patch :-) > > > > You can try the patch to see if it solves your issue, though I'd > > still like to understand why its happening in the first place. I > > may rework the patch based on Serge's suggestion, but it'd be nice > > to know if the one I sent does fix what you are seeing. It worked > > for all the hard-kill cases I tried. > > Both patches, the pidfile version and the reworked version fixed my > problem. So I'm very happy with it :-) > > > Will this patch also go to the stable-1.0 branch? > I'd really like to see this fixed in the 1.0.3 release ;-) Looks like Stéphane did pull it onto stable so you should be good. Thanks for trying to debug/strace it. I still don't know why this is happening in the first place but at least this should work around the problem when it does happen. > Florian > _______________________________________________ lxc-users mailing list [email protected] http://lists.linuxcontainers.org/listinfo/lxc-users
