>>>>> On Thu, 13 Sep 2007 14:14:39 +0200, Kern Sibbald said:
> 
> On Thursday 13 September 2007 13:37, Martin Simmons wrote:
> > >>>>> On Wed, 12 Sep 2007 23:09:17 +0200, Kern Sibbald said:
> > >
> > > Hello,
> > >
> > > I believe that I may have a fix for the FD crash shown below.  I'll
> > > commit it tomorrow as I have another fix for bug #920 that I need to
> > > fully test before committing.
> >
> > Is there a danger that it can terminate without deleting the pid etc now? 
> > It may be better to call pthread_exit instead of exit when already_here is
> > true.
> 
> I'm hoping that the true fix is the reordering of some of the termination 
> calls, although the dump that Dan produced clearly indicated the routine was 
> possibly called twice (the 05a5a ... argument to a call indicates reuse of a 
> freed Bacula buffer).

Yes, I think it happens like this:

1) The main thread is idle inside bnet_thread_server().

2) The watchdog thread gets a SIGTERM (presumably delivered from outside).

3) The signal handler in the watchdog thread calls terminate_filed as
   expected, which calls bnet_stop_thread_server, which sets quit to true and
   calls pthread_kill on the main thread.

4) The watchdog thread gets preempted.

5) The main thread wakes up due to the signal and returns from
   bnet_thread_server because quit is true.  It then calls terminate_filed,
   which frees the resources pointed to by the variable "me".

6) The watchdog thread runs again and passes me->FDaddrs from the freed "me"
   to get_first_port_host_order, which causes the crash.

In this situation, I think write_state_file and delete_pid_file will never be
called.

__Martin


> 
> Concerning the use of exit(1): my view is that if we get into that routine 
> twice, it is best to get out fast.  None of the exit routines are critical -- 
> yes not deleting the pid file could be annoying, but normally if we were 
> already in the file, it should be deleted.
> 
> >
> > Another minor issue is that lock-less use of variables like already_here is
> > not safe on CPUs with speculative execution.  It is unlikely to happen in
> > this code, but you never know.
> 
> Yes, it is possible that we could pass through the routine twice because the 
> variable is not set inside a mutex.  However, we are trying to exit, and it 
> will be set at some point.  The important point of the test is not so much to 
> prevent  calling the routine twice, which should not really cause any harm, 
> but prevent it from getting into some seg fault infinite loop.   IMO, any 
> additional use of mutexes is dangerous (could cause a deadlock) and overkill 
> when our main goal at that point is to do a minimal cleanup and exit.
> 
> I'm willing to see a patch that could simplify things even further, or would 
> improve the chances that the pid file is really deleted, but would be *very* 
> reluctant to accept anything that has additional locking (fear of a 
> deadlock).
> 
> >
> > __Martin
> >
> > > Regards,
> > >
> > > Kern
> > >
> > > On Wednesday 12 September 2007 21:15, Martin Simmons wrote:
> > > > >>>>> On Wed, 12 Sep 2007 12:00:59 -0400, Dan Langille said:
> > > > >
> > > > > Priority: normal
> > > > > Content-description: Mail message body
> > > > >
> > > > > After I encounterd repeated cores of bacula-sd, Eric Bollengier
> > > > > suggested this patch.  It passed my first set of regression tests
> > > > > without producing a bacula-sd.core
> > > > >
> > > > > The traceback appears at the end of this email.  Any objections to a
> > > > > commit now?
> > > > >
> > > > > $ svn di
> > > > > Index: stored/stored.c
> > > > > ===================================================================
> > > > > --- stored/stored.c     (revision 5534)
> > > > > +++ stored/stored.c     (working copy)
> > > > > @@ -600,10 +600,10 @@
> > > > >     if (debug_level > 10) {
> > > > >        print_memory_pool_stats();
> > > > >     }
> > > > > -   term_reservations_lock();
> > > > >     term_msg();
> > > > >     cleanup_crypto();
> > > > >     free_volume_list();
> > > > > +   term_reservations_lock();
> > > > >     close_memory_pool();
> > > > >
> > > > >     sm_dump(false);                    /* dump orphaned buffers */
> > > > > $
> > > >
> > > > Looks right, but that won't fix the bacula-fd crash you included below
> > > > :-)
> > > >
> > > > __Martin
> > > >
> > > > > And FYI:
> > > > >
> > > > > [EMAIL PROTECTED]:~/src/BaculaRegressionTesting] $ gdb bin/bacula-fd 
> > > > > bacula-
> > > > > fd.core
> > > > > GNU gdb 6.1.1 [FreeBSD]
> > > > > Copyright 2004 Free Software Foundation, Inc.
> > > > > GDB is free software, covered by the GNU General Public License, and
> > > > > you are
> > > > > welcome to change it and/or distribute copies of it under certain
> > > > > conditions.
> > > > > Type "show copying" to see the conditions.
> > > > > There is absolutely no warranty for GDB.  Type "show warranty" for
> > > > > details.
> > > > > This GDB was configured as "i386-marcel-freebsd"...
> > > > > Core was generated by `bacula-fd'.
> > > > > Program terminated with signal 11, Segmentation fault.
> > > > > Reading symbols from /lib/libz.so.4...done.
> > > > > Loaded symbols for /lib/libz.so.4
> > > > > Reading symbols from /lib/libthr.so.3...done.
> > > > > Loaded symbols for /lib/libthr.so.3
> > > > > Reading symbols from /usr/local/lib/libintl.so.8...done.
> > > > >
> > > >                > Loaded symbols for /usr/local/lib/libintl.so.8
> > > > >
> > > > > Reading symbols from /usr/local/lib/libiconv.so.3...done.
> > > > > Loaded symbols for /usr/local/lib/libiconv.so.3
> > > > > Reading symbols from /usr/lib/libssl.so.5...done.
> > > > > Loaded symbols for /usr/lib/libssl.so.5
> > > > > Reading symbols from /lib/libcrypto.so.5...done.
> > > > > Loaded symbols for /lib/libcrypto.so.5
> > > > > Reading symbols from /usr/lib/libstdc++.so.6...done.
> > > > > Loaded symbols for /usr/lib/libstdc++.so.6
> > > > > Reading symbols from /lib/libm.so.5...done.
> > > > > Loaded symbols for /lib/libm.so.5
> > > > > Reading symbols from /lib/libgcc_s.so.1...done.
> > > > > Loaded symbols for /lib/libgcc_s.so.1
> > > > > Reading symbols from /lib/libc.so.7...done.
> > > > > Loaded symbols for /lib/libc.so.7
> > > > > Reading symbols from /libexec/ld-elf.so.1...done.
> > > > > Loaded symbols for /libexec/ld-elf.so.1
> > > > > #0  get_first_port_host_order (addrs=0x5a5a5a5a) at address_conf.h:64
> > > > > 64      address_conf.h: No such file or directory.
> > > > >         in address_conf.h
> > > > > [New Thread 0x28601200 (LWP 100058)]
> > > > > [New Thread 0x28601100 (LWP 100478)]
> > > > > (gdb) backtrace
> > > > > #0  get_first_port_host_order (addrs=0x5a5a5a5a) at address_conf.h:64
> > > > > #1  0x0804cc93 in terminate_filed (sig=15) at filed.c:238
> > > > > #2  0x0807dbf9 in signal_handler (sig=15) at signal.c:180
> > > > > #3  0xbfbfffb4 in ?? ()
> > > > > #4  0x0000000f in ?? ()
> > > > > #5  0x00000000 in ?? ()
> > > > > #6  0xbf9febc0 in ?? ()
> > > > > #7  0x00000000 in ?? ()
> > > > > #8  0x0807d940 in init_stack_dump () at signal.c:190
> > > > > #9  0x280e41de in pthread_cond_init () from /lib/libthr.so.3
> > > > > #10 0x08081588 in watchdog_thread (arg=0x0) at watchdog.c:307
> > > > > #11 0x280dda7f in pthread_getprio () from /lib/libthr.so.3
> > > > > #12 0x00000000 in ?? ()
> > > > > Current language:  auto; currently c++
> > > > > (gdb)
> > > > >
> > > > > --
> > > > > Dan Langille - http://www.langille.org/
> > > > > Available for hire: http://www.freebsddiary.org/dan_langille.php
> > > >
> > > > -----------------------------------------------------------------------
> > > >-- This SF.net email is sponsored by: Microsoft
> > > > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > > _______________________________________________
> > > > Bacula-devel mailing list
> > > > [email protected]
> > > > https://lists.sourceforge.net/lists/listinfo/bacula-devel
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to