On Thursday 13 September 2007 15:40, Martin Simmons wrote:
> >>>>> On Thu, 13 Sep 2007 14:14:39 +0200, Kern Sibbald said:
> >
> > On Thursday 13 September 2007 13:37, Martin Simmons wrote:
> > > >>>>> On Wed, 12 Sep 2007 23:09:17 +0200, Kern Sibbald said:
> > > >
> > > > Hello,
> > > >
> > > > I believe that I may have a fix for the FD crash shown below. I'll
> > > > commit it tomorrow as I have another fix for bug #920 that I need to
> > > > fully test before committing.
> > >
> > > Is there a danger that it can terminate without deleting the pid etc
> > > now? It may be better to call pthread_exit instead of exit when
> > > already_here is true.
> >
> > I'm hoping that the true fix is the reordering of some of the termination
> > calls, although the dump that Dan produced clearly indicated the routine
> > was possibly called twice (the 05a5a ... argument to a call indicates
> > reuse of a freed Bacula buffer).
>
> Yes, I think it happens like this:
>
> 1) The main thread is idle inside bnet_thread_server().
>
> 2) The watchdog thread gets a SIGTERM (presumably delivered from outside).
>
> 3) The signal handler in the watchdog thread calls terminate_filed as
> expected, which calls bnet_stop_thread_server, which sets quit to true
> and calls pthread_kill on the main thread.
>
> 4) The watchdog thread gets preempted.
>
> 5) The main thread wakes up due to the signal and returns from
> bnet_thread_server because quit is true. It then calls terminate_filed,
> which frees the resources pointed to by the variable "me".
>
> 6) The watchdog thread runs again and passes me->FDaddrs from the freed
> "me" to get_first_port_host_order, which causes the crash.
>
> In this situation, I think write_state_file and delete_pid_file will never
> be called.
Nice analysis of the problem.
I'm reluctant to use pthread_exit() in place of the exit(1) because the
pthread_exit() does a lot of things, which could potentially put us into a
recursive loop condition. However, I have added a 2 second sleep before the
exit() so that any second call *should* yield to the first call allowing it
to cleanup or at least get to deleting the pid file.
Regards,
Kern
>
> __Martin
>
> > Concerning the use of exit(1): my view is that if we get into that
> > routine twice, it is best to get out fast. None of the exit routines are
> > critical -- yes not deleting the pid file could be annoying, but normally
> > if we were already in the file, it should be deleted.
> >
> > > Another minor issue is that lock-less use of variables like
> > > already_here is not safe on CPUs with speculative execution. It is
> > > unlikely to happen in this code, but you never know.
> >
> > Yes, it is possible that we could pass through the routine twice because
> > the variable is not set inside a mutex. However, we are trying to exit,
> > and it will be set at some point. The important point of the test is not
> > so much to prevent calling the routine twice, which should not really
> > cause any harm, but prevent it from getting into some seg fault infinite
> > loop. IMO, any additional use of mutexes is dangerous (could cause a
> > deadlock) and overkill when our main goal at that point is to do a
> > minimal cleanup and exit.
> >
> > I'm willing to see a patch that could simplify things even further, or
> > would improve the chances that the pid file is really deleted, but would
> > be *very* reluctant to accept anything that has additional locking (fear
> > of a deadlock).
> >
> > > __Martin
> > >
> > > > Regards,
> > > >
> > > > Kern
> > > >
> > > > On Wednesday 12 September 2007 21:15, Martin Simmons wrote:
> > > > > >>>>> On Wed, 12 Sep 2007 12:00:59 -0400, Dan Langille said:
> > > > > >
> > > > > > Priority: normal
> > > > > > Content-description: Mail message body
> > > > > >
> > > > > > After I encounterd repeated cores of bacula-sd, Eric Bollengier
> > > > > > suggested this patch. It passed my first set of regression tests
> > > > > > without producing a bacula-sd.core
> > > > > >
> > > > > > The traceback appears at the end of this email. Any objections
> > > > > > to a commit now?
> > > > > >
> > > > > > $ svn di
> > > > > > Index: stored/stored.c
> > > > > > =================================================================
> > > > > >== --- stored/stored.c (revision 5534)
> > > > > > +++ stored/stored.c (working copy)
> > > > > > @@ -600,10 +600,10 @@
> > > > > > if (debug_level > 10) {
> > > > > > print_memory_pool_stats();
> > > > > > }
> > > > > > - term_reservations_lock();
> > > > > > term_msg();
> > > > > > cleanup_crypto();
> > > > > > free_volume_list();
> > > > > > + term_reservations_lock();
> > > > > > close_memory_pool();
> > > > > >
> > > > > > sm_dump(false); /* dump orphaned buffers
> > > > > > */ $
> > > > >
> > > > > Looks right, but that won't fix the bacula-fd crash you included
> > > > > below
> > > > >
> > > > > :-)
> > > > >
> > > > > __Martin
> > > > >
> > > > > > And FYI:
> > > > > >
> > > > > > [EMAIL PROTECTED]:~/src/BaculaRegressionTesting] $ gdb bin/bacula-fd
> > > > > > bacula- fd.core
> > > > > > GNU gdb 6.1.1 [FreeBSD]
> > > > > > Copyright 2004 Free Software Foundation, Inc.
> > > > > > GDB is free software, covered by the GNU General Public License,
> > > > > > and you are
> > > > > > welcome to change it and/or distribute copies of it under certain
> > > > > > conditions.
> > > > > > Type "show copying" to see the conditions.
> > > > > > There is absolutely no warranty for GDB. Type "show warranty"
> > > > > > for details.
> > > > > > This GDB was configured as "i386-marcel-freebsd"...
> > > > > > Core was generated by `bacula-fd'.
> > > > > > Program terminated with signal 11, Segmentation fault.
> > > > > > Reading symbols from /lib/libz.so.4...done.
> > > > > > Loaded symbols for /lib/libz.so.4
> > > > > > Reading symbols from /lib/libthr.so.3...done.
> > > > > > Loaded symbols for /lib/libthr.so.3
> > > > > > Reading symbols from /usr/local/lib/libintl.so.8...done.
> > > > > >
> > > > > > Loaded symbols for /usr/local/lib/libintl.so.8
> > > > > >
> > > > > > Reading symbols from /usr/local/lib/libiconv.so.3...done.
> > > > > > Loaded symbols for /usr/local/lib/libiconv.so.3
> > > > > > Reading symbols from /usr/lib/libssl.so.5...done.
> > > > > > Loaded symbols for /usr/lib/libssl.so.5
> > > > > > Reading symbols from /lib/libcrypto.so.5...done.
> > > > > > Loaded symbols for /lib/libcrypto.so.5
> > > > > > Reading symbols from /usr/lib/libstdc++.so.6...done.
> > > > > > Loaded symbols for /usr/lib/libstdc++.so.6
> > > > > > Reading symbols from /lib/libm.so.5...done.
> > > > > > Loaded symbols for /lib/libm.so.5
> > > > > > Reading symbols from /lib/libgcc_s.so.1...done.
> > > > > > Loaded symbols for /lib/libgcc_s.so.1
> > > > > > Reading symbols from /lib/libc.so.7...done.
> > > > > > Loaded symbols for /lib/libc.so.7
> > > > > > Reading symbols from /libexec/ld-elf.so.1...done.
> > > > > > Loaded symbols for /libexec/ld-elf.so.1
> > > > > > #0 get_first_port_host_order (addrs=0x5a5a5a5a) at
> > > > > > address_conf.h:64 64 address_conf.h: No such file or
> > > > > > directory.
> > > > > > in address_conf.h
> > > > > > [New Thread 0x28601200 (LWP 100058)]
> > > > > > [New Thread 0x28601100 (LWP 100478)]
> > > > > > (gdb) backtrace
> > > > > > #0 get_first_port_host_order (addrs=0x5a5a5a5a) at
> > > > > > address_conf.h:64 #1 0x0804cc93 in terminate_filed (sig=15) at
> > > > > > filed.c:238 #2 0x0807dbf9 in signal_handler (sig=15) at
> > > > > > signal.c:180 #3 0xbfbfffb4 in ?? ()
> > > > > > #4 0x0000000f in ?? ()
> > > > > > #5 0x00000000 in ?? ()
> > > > > > #6 0xbf9febc0 in ?? ()
> > > > > > #7 0x00000000 in ?? ()
> > > > > > #8 0x0807d940 in init_stack_dump () at signal.c:190
> > > > > > #9 0x280e41de in pthread_cond_init () from /lib/libthr.so.3
> > > > > > #10 0x08081588 in watchdog_thread (arg=0x0) at watchdog.c:307
> > > > > > #11 0x280dda7f in pthread_getprio () from /lib/libthr.so.3
> > > > > > #12 0x00000000 in ?? ()
> > > > > > Current language: auto; currently c++
> > > > > > (gdb)
> > > > > >
> > > > > > --
> > > > > > Dan Langille - http://www.langille.org/
> > > > > > Available for hire: http://www.freebsddiary.org/dan_langille.php
> > > > >
> > > > > -------------------------------------------------------------------
> > > > >---- -- This SF.net email is sponsored by: Microsoft
> > > > > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > > > _______________________________________________
> > > > > Bacula-devel mailing list
> > > > > [email protected]
> > > > > https://lists.sourceforge.net/lists/listinfo/bacula-devel
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel