On Friday 14 September 2007 21:56, Martin Simmons wrote:
> >>>>> On Thu, 13 Sep 2007 18:52:10 +0200, Kern Sibbald said:
> >
> > On Thursday 13 September 2007 15:40, Martin Simmons wrote:
> > > >>>>> On Thu, 13 Sep 2007 14:14:39 +0200, Kern Sibbald said:
> > > >
> > > > On Thursday 13 September 2007 13:37, Martin Simmons wrote:
> > > > > >>>>> On Wed, 12 Sep 2007 23:09:17 +0200, Kern Sibbald said:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I believe that I may have a fix for the FD crash shown below. 
> > > > > > I'll commit it tomorrow as I have another fix for bug #920 that I
> > > > > > need to fully test before committing.
> > > > >
> > > > > Is there a danger that it can terminate without deleting the pid
> > > > > etc now? It may be better to call pthread_exit instead of exit when
> > > > > already_here is true.
> > > >
> > > > I'm hoping that the true fix is the reordering of some of the
> > > > termination calls, although the dump that Dan produced clearly
> > > > indicated the routine was possibly called twice (the 05a5a ...
> > > > argument to a call indicates reuse of a freed Bacula buffer).
> > >
> > > Yes, I think it happens like this:
> > >
> > > 1) The main thread is idle inside bnet_thread_server().
> > >
> > > 2) The watchdog thread gets a SIGTERM (presumably delivered from
> > > outside).
> > >
> > > 3) The signal handler in the watchdog thread calls terminate_filed as
> > >    expected, which calls bnet_stop_thread_server, which sets quit to
> > > true and calls pthread_kill on the main thread.
> > >
> > > 4) The watchdog thread gets preempted.
> > >
> > > 5) The main thread wakes up due to the signal and returns from
> > >    bnet_thread_server because quit is true.  It then calls
> > > terminate_filed, which frees the resources pointed to by the variable
> > > "me".
> > >
> > > 6) The watchdog thread runs again and passes me->FDaddrs from the freed
> > > "me" to get_first_port_host_order, which causes the crash.
> > >
> > > In this situation, I think write_state_file and delete_pid_file will
> > > never be called.
> >
> > Nice analysis of the problem.
> >
> > I'm reluctant to use pthread_exit() in place of the exit(1) because the
> > pthread_exit() does a lot of things, which could potentially put us into
> > a recursive loop condition.  However, I have added a 2 second sleep
> > before the exit() so that any second call *should* yield to the first
> > call allowing it to cleanup or at least get to deleting the pid file.
>
> Ah, neat idea.

Thanks :-)   And thanks for pointing out the problem ...

>
> __Martin
>
> > Regards,
> >
> > Kern
> >
> > > __Martin
> > >
> > > > Concerning the use of exit(1): my view is that if we get into that
> > > > routine twice, it is best to get out fast.  None of the exit routines
> > > > are critical -- yes not deleting the pid file could be annoying, but
> > > > normally if we were already in the file, it should be deleted.
> > > >
> > > > > Another minor issue is that lock-less use of variables like
> > > > > already_here is not safe on CPUs with speculative execution.  It is
> > > > > unlikely to happen in this code, but you never know.
> > > >
> > > > Yes, it is possible that we could pass through the routine twice
> > > > because the variable is not set inside a mutex.  However, we are
> > > > trying to exit, and it will be set at some point.  The important
> > > > point of the test is not so much to prevent  calling the routine
> > > > twice, which should not really cause any harm, but prevent it from
> > > > getting into some seg fault infinite loop.   IMO, any additional use
> > > > of mutexes is dangerous (could cause a deadlock) and overkill when
> > > > our main goal at that point is to do a minimal cleanup and exit.
> > > >
> > > > I'm willing to see a patch that could simplify things even further,
> > > > or would improve the chances that the pid file is really deleted, but
> > > > would be *very* reluctant to accept anything that has additional
> > > > locking (fear of a deadlock).
> > > >
> > > > > __Martin
> > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Kern
> > > > > >
> > > > > > On Wednesday 12 September 2007 21:15, Martin Simmons wrote:
> > > > > > > >>>>> On Wed, 12 Sep 2007 12:00:59 -0400, Dan Langille said:
> > > > > > > >
> > > > > > > > Priority: normal
> > > > > > > > Content-description: Mail message body
> > > > > > > >
> > > > > > > > After I encounterd repeated cores of bacula-sd, Eric
> > > > > > > > Bollengier suggested this patch.  It passed my first set of
> > > > > > > > regression tests without producing a bacula-sd.core
> > > > > > > >
> > > > > > > > The traceback appears at the end of this email.  Any
> > > > > > > > objections to a commit now?
> > > > > > > >
> > > > > > > > $ svn di
> > > > > > > > Index: stored/stored.c
> > > > > > > > =============================================================
> > > > > > > >==== == --- stored/stored.c     (revision 5534)
> > > > > > > > +++ stored/stored.c     (working copy)
> > > > > > > > @@ -600,10 +600,10 @@
> > > > > > > >     if (debug_level > 10) {
> > > > > > > >        print_memory_pool_stats();
> > > > > > > >     }
> > > > > > > > -   term_reservations_lock();
> > > > > > > >     term_msg();
> > > > > > > >     cleanup_crypto();
> > > > > > > >     free_volume_list();
> > > > > > > > +   term_reservations_lock();
> > > > > > > >     close_memory_pool();
> > > > > > > >
> > > > > > > >     sm_dump(false);                    /* dump orphaned
> > > > > > > > buffers */ $
> > > > > > >
> > > > > > > Looks right, but that won't fix the bacula-fd crash you
> > > > > > > included below
> > > > > > >
> > > > > > > :-)
> > > > > > >
> > > > > > > __Martin
> > > > > > >
> > > > > > > > And FYI:
> > > > > > > >
> > > > > > > > [EMAIL PROTECTED]:~/src/BaculaRegressionTesting] $ gdb 
> > > > > > > > bin/bacula-fd
> > > > > > > > bacula- fd.core
> > > > > > > > GNU gdb 6.1.1 [FreeBSD]
> > > > > > > > Copyright 2004 Free Software Foundation, Inc.
> > > > > > > > GDB is free software, covered by the GNU General Public
> > > > > > > > License, and you are
> > > > > > > > welcome to change it and/or distribute copies of it under
> > > > > > > > certain conditions.
> > > > > > > > Type "show copying" to see the conditions.
> > > > > > > > There is absolutely no warranty for GDB.  Type "show
> > > > > > > > warranty" for details.
> > > > > > > > This GDB was configured as "i386-marcel-freebsd"...
> > > > > > > > Core was generated by `bacula-fd'.
> > > > > > > > Program terminated with signal 11, Segmentation fault.
> > > > > > > > Reading symbols from /lib/libz.so.4...done.
> > > > > > > > Loaded symbols for /lib/libz.so.4
> > > > > > > > Reading symbols from /lib/libthr.so.3...done.
> > > > > > > > Loaded symbols for /lib/libthr.so.3
> > > > > > > > Reading symbols from /usr/local/lib/libintl.so.8...done.
> > > > > > > >
> > > > > > >          > Loaded symbols for /usr/local/lib/libintl.so.8
> > > > > > > >
> > > > > > > > Reading symbols from /usr/local/lib/libiconv.so.3...done.
> > > > > > > > Loaded symbols for /usr/local/lib/libiconv.so.3
> > > > > > > > Reading symbols from /usr/lib/libssl.so.5...done.
> > > > > > > > Loaded symbols for /usr/lib/libssl.so.5
> > > > > > > > Reading symbols from /lib/libcrypto.so.5...done.
> > > > > > > > Loaded symbols for /lib/libcrypto.so.5
> > > > > > > > Reading symbols from /usr/lib/libstdc++.so.6...done.
> > > > > > > > Loaded symbols for /usr/lib/libstdc++.so.6
> > > > > > > > Reading symbols from /lib/libm.so.5...done.
> > > > > > > > Loaded symbols for /lib/libm.so.5
> > > > > > > > Reading symbols from /lib/libgcc_s.so.1...done.
> > > > > > > > Loaded symbols for /lib/libgcc_s.so.1
> > > > > > > > Reading symbols from /lib/libc.so.7...done.
> > > > > > > > Loaded symbols for /lib/libc.so.7
> > > > > > > > Reading symbols from /libexec/ld-elf.so.1...done.
> > > > > > > > Loaded symbols for /libexec/ld-elf.so.1
> > > > > > > > #0  get_first_port_host_order (addrs=0x5a5a5a5a) at
> > > > > > > > address_conf.h:64 64      address_conf.h: No such file or
> > > > > > > > directory.
> > > > > > > >         in address_conf.h
> > > > > > > > [New Thread 0x28601200 (LWP 100058)]
> > > > > > > > [New Thread 0x28601100 (LWP 100478)]
> > > > > > > > (gdb) backtrace
> > > > > > > > #0  get_first_port_host_order (addrs=0x5a5a5a5a) at
> > > > > > > > address_conf.h:64 #1  0x0804cc93 in terminate_filed (sig=15)
> > > > > > > > at filed.c:238 #2  0x0807dbf9 in signal_handler (sig=15) at
> > > > > > > > signal.c:180 #3  0xbfbfffb4 in ?? ()
> > > > > > > > #4  0x0000000f in ?? ()
> > > > > > > > #5  0x00000000 in ?? ()
> > > > > > > > #6  0xbf9febc0 in ?? ()
> > > > > > > > #7  0x00000000 in ?? ()
> > > > > > > > #8  0x0807d940 in init_stack_dump () at signal.c:190
> > > > > > > > #9  0x280e41de in pthread_cond_init () from /lib/libthr.so.3
> > > > > > > > #10 0x08081588 in watchdog_thread (arg=0x0) at watchdog.c:307
> > > > > > > > #11 0x280dda7f in pthread_getprio () from /lib/libthr.so.3
> > > > > > > > #12 0x00000000 in ?? ()
> > > > > > > > Current language:  auto; currently c++
> > > > > > > > (gdb)
> > > > > > > >
> > > > > > > > --
> > > > > > > > Dan Langille - http://www.langille.org/
> > > > > > > > Available for hire:
> > > > > > > > http://www.freebsddiary.org/dan_langille.php
> > > > > > >
> > > > > > > ---------------------------------------------------------------
> > > > > > >---- ---- -- This SF.net email is sponsored by: Microsoft
> > > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > > > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > > > > > _______________________________________________
> > > > > > > Bacula-devel mailing list
> > > > > > > [email protected]
> > > > > > > https://lists.sourceforge.net/lists/listinfo/bacula-devel

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to