There is something wrong with what seems to be a RunAfterJob "/usr/local/bacula/scripts/delete_catalog_backup" script. Perhaps there is a problem with command line editing or some other race condition because it is passing an invalid address off to the fork().
On Friday 07 April 2006 03:57, Joshua Kugler wrote: > (gdb) run -s -f -c ../etc/bacula-dir.conf > Starting program: /usr/local/bacula/sbin/bacula-dir -s -f > -c ../etc/bacula-dir.conf > [Please pardon the top post] > > OK, I compiled 1.36.3, and ran the director under gdb. After some normal > execution, including some volume purging, I tried to start a bunch of jobs > like so: > > for ii in BackupCatalog fmpserver distance locus community elive admin1 > communications1 communications2 communications3 grades1 libro otter > records1 registrar1 ruth sheri1 textbook1 textbook3 textbook4 curt > fiscalpro bob5 idbigblue odonnell webmaster webdev1 > do echo -e "run $ii\nmod\n1\n2\nyes\nq\n"| ./bconsole > done > > It went on for a few OK, and then it died with the message shown below. > BTW, I was able to kill it like this twice. > > When bacula-dir crashed, it also left a few rows in the db with a client ID > of 0. > > I got this from running bacula-dir inside gdb. (output from the thread > dump below). > > [Thread debugging using libthread_db enabled] > [New Thread 16384 (LWP 19065)] > [New Thread 32769 (LWP 19067)] > [New Thread 16386 (LWP 19068)] > [New Thread 32771 (LWP 19069)] > [New Thread 49156 (LWP 19072)] > [Thread 49156 (LWP 19072) exited] > [New Thread 65540 (LWP 19216)] > [New Thread 81925 (LWP 19219)] > [New Thread 98310 (LWP 19221)] > [Thread 65540 (LWP 19216) exited] > [New Thread 114692 (LWP 19234)] > [Thread 114692 (LWP 19234) exited] > [New Thread 131076 (LWP 19295)] > herodotus-dir: dird.c:438 Director's configuration file reread. > [Thread 131076 (LWP 19295) exited] > [New Thread 147460 (LWP 19300)] > [New Thread 163847 (LWP 19302)] > [New Thread 180232 (LWP 19305)] > [Thread 147460 (LWP 19300) exited] > [New Thread 196612 (LWP 19308)] > [New Thread 213001 (LWP 19312)] > [Thread 180232 (LWP 19305) exited] > [New Thread 229384 (LWP 19314)] > [New Thread 245770 (LWP 19319)] > Detaching after fork from child process 19321. > [New Thread 262155 (LWP 19323)] > [Thread 213001 (LWP 19312) exited] > [New Thread 278537 (LWP 19325)] > [New Thread 294924 (LWP 19330)] > [New Thread 311309 (LWP 19333)] > [Thread 245770 (LWP 19319) exited] > Cannot find thread 245770: invalid thread handle > (gdb) > > Here is the output from "thread apply all bt" > > (gdb) thread apply all bt > > Thread 30 (Thread 458763 (LWP 19883)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000000 in ?? () > #2 0x0807de3c in bmicrosleep (sec=0, usec=-1251095380) at bsys.c:59 > #3 0x08057125 in create_unique_job_name (jcr=0x80f5d20, > base_name=0x80a59b5 "*Console*") at job.c:658 > #4 0x08070ae0 in new_control_jcr (base_name=0x80a59b5 "*Console*", > job_type=-516) at ua_server.c:101 > #5 0x08070c97 in handle_UA_client_request (arg=0x80d7518) at > ua_server.c:122 #6 0x08098426 in workq_server (arg=0x80bdda0) at > workq.c:347 > #7 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #8 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #9 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > > Thread 29 (Thread 442376 (LWP 19878)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000000 in ?? () > #2 0xb7e84188 in __pthread_timedsuspend_new () from > /lib/i686/libpthread.so.0 #3 0xb7e803e9 in pthread_cond_timedwait_relative > () > from /lib/i686/libpthread.so.0 > #4 0x080982e7 in workq_server (arg=0x80bdda0) at workq.c:322 > #5 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #6 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #7 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > > Thread 16 (Thread 229383 (LWP 19815)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000000 in ?? () > #2 0x0807de3c in bmicrosleep (sec=2, usec=-1236410932) at bsys.c:59 > #3 0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674 > #4 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #5 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #6 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > > Thread 14 (Thread 196618 (LWP 19807)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000000 in ?? () > #2 0x0807de3c in bmicrosleep (sec=2, usec=-1248997940) at bsys.c:59 > #3 0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674 > #4 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #5 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #6 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > ---Type <return> to continue, or q <return> to quit--- > > Thread 11 (Thread 147462 (LWP 19798)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000000 in ?? () > #2 0x0807de3c in bmicrosleep (sec=2, usec=-1234309684) at bsys.c:59 > #3 0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674 > #4 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #5 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #6 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > > Thread 9 (Thread 114692 (LWP 19778)): > #0 0xb7e8289b in __pthread_fork () from /lib/i686/libpthread.so.0 > #1 0xb7ce81a8 in fork () from /lib/i686/libc.so.6 > #2 0xb7e82954 in fork () from /lib/i686/libpthread.so.0 > #3 0x080818b1 in open_bpipe (prog=0x80ccdf8 > "/usr/local/bacula/scripts/delete_catalog_backup", wait=0, > mode=0x1 <Address 0x1 out of bounds>) at bpipe.c:90 > #4 0x08056c99 in job_thread (arg=0x80de298) at job.c:262 > #5 0x080592bb in jobq_server (arg=0x80bdc20) at jobq.c:444 > #6 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #7 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #8 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > > Thread 7 (Thread 81925 (LWP 19772)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000000 in ?? () > #2 0x0807de3c in bmicrosleep (sec=2, usec=-1232212532) at bsys.c:59 > #3 0x0805955a in jobq_server (arg=0x80bdc20) at jobq.c:674 > #4 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #5 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #6 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > > Thread 4 (Thread 32771 (LWP 19615)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000001 in ?? () > #2 0xb7e84188 in __pthread_timedsuspend_new () from > /lib/i686/libpthread.so.0 #3 0xb7e803e9 in pthread_cond_timedwait_relative > () > from /lib/i686/libpthread.so.0 > #4 0x0809777a in watchdog_thread (arg=0x0) at watchdog.c:289 > #5 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #6 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #7 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > ---Type <return> to continue, or q <return> to quit--- > > Thread 3 (Thread 16386 (LWP 19614)): > #0 0xb7d174a1 in select () from /lib/i686/libc.so.6 > #1 0x0000000b in ?? () > #2 0x080d66bc in ?? () > #3 0xb7adf234 in ?? () > #4 0x00000000 in ?? () > #5 0x08081167 in bnet_thread_server (addrs=0x0, max_clients=10, > client_wq=0x80bdda0, > handle_client_request=0x8070c70 <handle_UA_client_request>) at > bnet_server.c:154 > #6 0x08070a58 in connect_thread (arg=0x80bff38) at ua_server.c:79 > #7 0xb7e81421 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #8 0xb7e81591 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 #9 0xb7d1e36a in clone () from > /lib/i686/libc.so.6 > > Thread 2 (Thread 32769 (LWP 19613)): > #0 0xb7d1529a in poll () from /lib/i686/libc.so.6 > #1 0xb7e80f00 in __pthread_manager () from /lib/i686/libpthread.so.0 > #2 0xb7e811d5 in __pthread_manager_event () from /lib/i686/libpthread.so.0 > #3 0xb7d1e36a in clone () from /lib/i686/libc.so.6 > > Thread 1 (Thread 16384 (LWP 19609)): > #0 0xb7e87db6 in nanosleep () from /lib/i686/libpthread.so.0 > #1 0x00000000 in ?? () > #2 0x0807de3c in bmicrosleep (sec=60, usec=-1073744136) at bsys.c:59 > #3 0x0805eac0 in wait_for_next_job (one_shot_job_to_run=0x80cbac0 "") at > scheduler.c:101 > #4 0x0804c057 in main (argc=135003960, argv=0x80c0010) at dird.c:244 > Segmentation fault > > On Thursday 06 April 2006 17:10, Joshua Kugler wrote: > > [Disclaimer: I've searched the archives best I know how. If you can > > point me to docs and/or messages I missed, that'd be great!] > > > > We've been using Bacula for over a year, and it has run great. Recently, > > we got a nice disk-based 5.1TB array (Coraid AoE if you care) are working > > on implementing it with Bacula. All the configuration has gone great, > > and we're going test runs. > > > > This is where we run into problems. > > > > If I fire off Full backups of all the clients, it will run OK for a > > while. Then at one point, I tried a command on bconsole, and it said > > > > 06-Apr 15:25 bconsole: Error: bnet.c:403 Write error sending to Director > > daemon:herodotus.cde.uaf.edu:9101: ERR=Broken pipe > > [EMAIL PROTECTED] /usr/local/bacula/sbin]# ./bconsole > > Connecting to Director herodotus.cde.uaf.edu:9101 > > 06-Apr 15:25 bconsole: Fatal error: bnet.c:773 Unable to connect to > > Director daemon on herodotus.cde.uaf.edu:9101. > > > > A ps -Af shows *no* bacula-dir processes left running. Top shows > > bacula-sd still grinding away, as well as some of the SSH tunnels. I can > > still get to the network drive and do things like ls and du, so it's not > > lost communication. Restarting bacula and doing status from bconsole > > shows no jobs running, but the database shows a bunch of jobs in > > JobStatus "R". > > > > The bacula (/var/bacula/working/log) shows nothing out of the ordinary. > > > > This is on Linux, with kernel 2.6.11-12mdksmp, Bacula 1.36.1, 1GB of > > memory. There is no dump, stack trace, or e-mail about the crash. > > > > I know there are more recent versions. I don't have time right now to > > upgrade all my clients. Should I try 1.36.3 before I throw in the towel? > > Any other ideas? Am I hitting the race condition noted here: > > http://article.gmane.org/gmane.comp.sysutils.backup.bacula.general/16842 > > > > j----- k----- -- Best regards, Kern ("> /\ V_V ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users