There is something wrong with what seems to be a RunAfterJob 
"/usr/local/bacula/scripts/delete_catalog_backup" script.  Perhaps there is a 
problem with command line editing or some other race condition because it is 
passing an invalid address off to the fork().

On Friday 07 April 2006 03:57, Joshua Kugler wrote:
> (gdb) run  -s -f -c ../etc/bacula-dir.conf
> Starting program: /usr/local/bacula/sbin/bacula-dir -s -f
> -c ../etc/bacula-dir.conf
> [Please pardon the top post]
>
> OK, I compiled 1.36.3, and ran the director under gdb.  After some normal
> execution, including some volume purging, I tried to start a bunch of jobs
> like so:
>
> for ii in BackupCatalog fmpserver distance locus community elive admin1
> communications1 communications2 communications3 grades1 libro otter
> records1 registrar1 ruth sheri1 textbook1 textbook3 textbook4 curt
> fiscalpro bob5 idbigblue odonnell webmaster webdev1
> do echo -e "run $ii\nmod\n1\n2\nyes\nq\n"| ./bconsole
> done
>
> It went on for a few OK, and then it died with the message shown below. 
> BTW, I was able to kill it like this twice.
>
> When bacula-dir crashed, it also left a few rows in the db with a client ID
> of 0.
>
> I got this from running bacula-dir inside gdb.  (output from the thread
> dump below).
>
> [Thread debugging using libthread_db enabled]
> [New Thread 16384 (LWP 19065)]
> [New Thread 32769 (LWP 19067)]
> [New Thread 16386 (LWP 19068)]
> [New Thread 32771 (LWP 19069)]
> [New Thread 49156 (LWP 19072)]
> [Thread 49156 (LWP 19072) exited]
> [New Thread 65540 (LWP 19216)]
> [New Thread 81925 (LWP 19219)]
> [New Thread 98310 (LWP 19221)]
> [Thread 65540 (LWP 19216) exited]
> [New Thread 114692 (LWP 19234)]
> [Thread 114692 (LWP 19234) exited]
> [New Thread 131076 (LWP 19295)]
> herodotus-dir: dird.c:438 Director's configuration file reread.
> [Thread 131076 (LWP 19295) exited]
> [New Thread 147460 (LWP 19300)]
> [New Thread 163847 (LWP 19302)]
> [New Thread 180232 (LWP 19305)]
> [Thread 147460 (LWP 19300) exited]
> [New Thread 196612 (LWP 19308)]
> [New Thread 213001 (LWP 19312)]
> [Thread 180232 (LWP 19305) exited]
> [New Thread 229384 (LWP 19314)]
> [New Thread 245770 (LWP 19319)]
> Detaching after fork from child process 19321.
> [New Thread 262155 (LWP 19323)]
> [Thread 213001 (LWP 19312) exited]
> [New Thread 278537 (LWP 19325)]
> [New Thread 294924 (LWP 19330)]
> [New Thread 311309 (LWP 19333)]
> [Thread 245770 (LWP 19319) exited]
> Cannot find thread 245770: invalid thread handle
> (gdb)
>
> Here is the output from "thread apply all bt"
>
> (gdb) thread apply all bt
>
> Thread 30 (Thread 458763 (LWP 19883)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000000 in ?? ()
> #2  0x0807de3c in bmicrosleep (sec=0, usec=-1251095380) at bsys.c:59
> #3  0x08057125 in create_unique_job_name (jcr=0x80f5d20,
> base_name=0x80a59b5 "*Console*") at job.c:658
> #4  0x08070ae0 in new_control_jcr (base_name=0x80a59b5 "*Console*",
> job_type=-516) at ua_server.c:101
> #5  0x08070c97 in handle_UA_client_request (arg=0x80d7518) at
> ua_server.c:122 #6  0x08098426 in workq_server (arg=0x80bdda0) at
> workq.c:347
> #7  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #8  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #9  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
>
> Thread 29 (Thread 442376 (LWP 19878)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000000 in ?? ()
> #2  0xb7e84188 in __pthread_timedsuspend_new () from
> /lib/i686/libpthread.so.0 #3  0xb7e803e9 in pthread_cond_timedwait_relative
> ()
> from /lib/i686/libpthread.so.0
> #4  0x080982e7 in workq_server (arg=0x80bdda0) at workq.c:322
> #5  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #6  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #7  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
>
> Thread 16 (Thread 229383 (LWP 19815)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000000 in ?? ()
> #2  0x0807de3c in bmicrosleep (sec=2, usec=-1236410932) at bsys.c:59
> #3  0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674
> #4  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #5  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #6  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
>
> Thread 14 (Thread 196618 (LWP 19807)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000000 in ?? ()
> #2  0x0807de3c in bmicrosleep (sec=2, usec=-1248997940) at bsys.c:59
> #3  0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674
> #4  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #5  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #6  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
> ---Type <return> to continue, or q <return> to quit---
>
> Thread 11 (Thread 147462 (LWP 19798)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000000 in ?? ()
> #2  0x0807de3c in bmicrosleep (sec=2, usec=-1234309684) at bsys.c:59
> #3  0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674
> #4  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #5  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #6  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
>
> Thread 9 (Thread 114692 (LWP 19778)):
> #0  0xb7e8289b in __pthread_fork () from /lib/i686/libpthread.so.0
> #1  0xb7ce81a8 in fork () from /lib/i686/libc.so.6
> #2  0xb7e82954 in fork () from /lib/i686/libpthread.so.0
> #3  0x080818b1 in open_bpipe (prog=0x80ccdf8
> "/usr/local/bacula/scripts/delete_catalog_backup", wait=0,
>     mode=0x1 <Address 0x1 out of bounds>) at bpipe.c:90
> #4  0x08056c99 in job_thread (arg=0x80de298) at job.c:262
> #5  0x080592bb in jobq_server (arg=0x80bdc20) at jobq.c:444
> #6  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #7  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #8  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
>
> Thread 7 (Thread 81925 (LWP 19772)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000000 in ?? ()
> #2  0x0807de3c in bmicrosleep (sec=2, usec=-1232212532) at bsys.c:59
> #3  0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674
> #4  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #5  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #6  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
>
> Thread 4 (Thread 32771 (LWP 19615)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000001 in ?? ()
> #2  0xb7e84188 in __pthread_timedsuspend_new () from
> /lib/i686/libpthread.so.0 #3  0xb7e803e9 in pthread_cond_timedwait_relative
> ()
> from /lib/i686/libpthread.so.0
> #4  0x0809777a in watchdog_thread (arg=0x0) at watchdog.c:289
> #5  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #6  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #7  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
> ---Type <return> to continue, or q <return> to quit---
>
> Thread 3 (Thread 16386 (LWP 19614)):
> #0  0xb7d174a1 in select () from /lib/i686/libc.so.6
> #1  0x0000000b in ?? ()
> #2  0x080d66bc in ?? ()
> #3  0xb7adf234 in ?? ()
> #4  0x00000000 in ?? ()
> #5  0x08081167 in bnet_thread_server (addrs=0x0, max_clients=10,
> client_wq=0x80bdda0,
>     handle_client_request=0x8070c70 <handle_UA_client_request>) at
> bnet_server.c:154
> #6  0x08070a58 in connect_thread (arg=0x80bff38) at ua_server.c:79
> #7  0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0
> #8  0xb7e81591 in pthread_start_thread_event () from
> /lib/i686/libpthread.so.0 #9  0xb7d1e36a in clone () from
> /lib/i686/libc.so.6
>
> Thread 2 (Thread 32769 (LWP 19613)):
> #0  0xb7d1529a in poll () from /lib/i686/libc.so.6
> #1  0xb7e80f00 in __pthread_manager () from /lib/i686/libpthread.so.0
> #2  0xb7e811d5 in __pthread_manager_event () from /lib/i686/libpthread.so.0
> #3  0xb7d1e36a in clone () from /lib/i686/libc.so.6
>
> Thread 1 (Thread 16384 (LWP 19609)):
> #0  0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0
> #1  0x00000000 in ?? ()
> #2  0x0807de3c in bmicrosleep (sec=60, usec=-1073744136) at bsys.c:59
> #3  0x0805eac0 in wait_for_next_job (one_shot_job_to_run=0x80cbac0 "") at
> scheduler.c:101
> #4  0x0804c057 in main (argc=135003960, argv=0x80c0010) at dird.c:244
> Segmentation fault
>
> On Thursday 06 April 2006 17:10, Joshua Kugler wrote:
> > [Disclaimer: I've searched the archives best I know how.  If you can
> > point me to docs and/or messages I missed, that'd be great!]
> >
> > We've been using Bacula for over a year, and it has run great.  Recently,
> > we got a nice disk-based 5.1TB array (Coraid AoE if you care) are working
> > on implementing it with Bacula.  All the configuration has gone great,
> > and we're going test runs.
> >
> > This is where we run into problems.
> >
> > If I fire off Full backups of all the clients, it will run OK for a
> > while. Then at one point, I tried a command on bconsole, and it said
> >
> > 06-Apr 15:25 bconsole:  Error: bnet.c:403 Write error sending to Director
> > daemon:herodotus.cde.uaf.edu:9101: ERR=Broken pipe
> > [EMAIL PROTECTED] /usr/local/bacula/sbin]# ./bconsole
> > Connecting to Director herodotus.cde.uaf.edu:9101
> > 06-Apr 15:25 bconsole:  Fatal error: bnet.c:773 Unable to connect to
> > Director daemon on herodotus.cde.uaf.edu:9101.
> >
> > A ps -Af shows *no* bacula-dir processes left running.  Top shows
> > bacula-sd still grinding away, as well as some of the SSH tunnels.  I can
> > still get to the network drive and do things like ls and du, so it's not
> > lost communication.  Restarting bacula and doing status from bconsole
> > shows no jobs running, but the database shows a bunch of jobs in
> > JobStatus "R".
> >
> > The bacula (/var/bacula/working/log) shows nothing out of the ordinary.
> >
> > This is on Linux, with kernel 2.6.11-12mdksmp, Bacula 1.36.1, 1GB of
> > memory. There is no dump, stack trace, or e-mail about the crash.
> >
> > I know there are more recent versions.  I don't have time right now to
> > upgrade all my clients.  Should I try 1.36.3 before I throw in the towel?
> > Any other ideas?  Am I hitting the race condition noted here:
> > http://article.gmane.org/gmane.comp.sysutils.backup.bacula.general/16842
> >
> > j----- k-----

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to