Hello Volker,

There were one or two race conditions that I fixed in 1.36.3.  You might look 
at the release notes and see if they appy to you.  Beware 1.36.3 requires the 
new format FileSets (and hence a Full backup unless you explicitly disable 
it).

On Tuesday 12 July 2005 00:24, Volker Sauer wrote:
> Hello,
>
> after some weeks of proper operation, my new bacula site causes
> problems.
>
> Problem description:
> After setting up the site, job scheduling worked for about 2 weeks. Now,
> quite often, the director hangs after starting the scheduled jobs at
> night. Connecting via console is not possible anymore.
>
> The only solution so far was restarting the director and manually
> running the jobs. After a few days (maybe even the next night) the
> problem occurs again.
> Looking into the archives of this list, I found some similar problem
> descriptions but none of the suggest things lead to a solution (at least
> none reported).
>
> The amazing thing is, that I have a second site with very similar
> config but only another loader, that works fine for years now.
> Now, I found out, that it could have come from a hanging sd, too. See
> below why. Question is: why hangs the dir because the sd is hanging?
>
> I haven't been able to produce a trace-file. I'll set debuglevel=100 and
> trace=1 now and see if I can gather some information and post it to this
> list as soon as possbile.
> What I can provide (so far) is some other debugging output (see below).
>
> Does anyone have ideas or comments on this odd behaviour??
>
> Regards
> Volker
>
> Debugging-Session attached:
>
> 8<-------------------------------------------------------------------------
>------
>
> System: Debian Sarge Kernel 2.6.8-2-k7-smp
> Bacula: 1.36.2-2sarge1
> Storage: Overland 10x DLT with Quantum DLT40/80
> Database: mysql 4.0.24-10 (Size: 1.4G)
> dir, sd and mysql on the same machine. Approx. 15 Clients. DiskSpooling
> for all machines except bacula-host itself.
>
> -----------------------------------------------------
> dakar: / 7# bconsole
> Connecting to Director dakar:9101
> Director authorization problem.
> Most likely the passwords do not agree.
> Please see
> http://www.bacula.org/html-manual/faq.html#AuthorizationErrors for help.
> -----------------------------------------------------
> (passwords are definitely okay).
>
>
> This is the conmesg-file so far:
> -----------------------
> dakar: /var/lib/bacula 33# cat backup-dir.conmsg
>
> 11-Jul 21:00 backup-dir: Start Backup JobId
> 899,Job=paris-home.archived.2005-07-11_21.00.00 11-Jul 21:00 backup-dir:
> Start Backup JobId 900,Job=paris-netboot.2005-07-11_21.00.01 11-Jul 21:00
> backup-sd: 3301 Issuing autochanger "loaded drive 0" command. 11-Jul 21:00
> backup-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. 11-Jul
> 21:00 backup-sd: Volume "DiffInc-03" previously written, moving to end of
> data. 11-Jul 21:00 backup-dir: Start Backup JobId 907,
> Job=bali-rootfs.2005-07-11_21.00.08 ------------------------
> The bali-fd is up and seems to okay. The director seems to wait for
> something. Nothing else in this file
>
>
> There's nothing in the syslog (SCSI error etc.), either:
> ----------------------------------------------------
> Jul 11 20:40:30 dakar -- MARK --
> Jul 11 21:00:30 dakar -- MARK --
> Jul 11 21:17:01 dakar /USR/SBIN/CRON[12590]: (root) CMD (   run-parts
> --report /etc/cron.hourly)
> Jul 11 21:40:30 dakar -- MARK --
> Jul 11 22:00:30 dakar -- MARK --
> Jul 11 22:17:01 dakar /USR/SBIN/CRON[12737]: (root) CMD (   run-parts
> --report /etc/cron.hourly)
> Jul 11 22:40:30 dakar -- MARK --
> Jul 11 23:00:30 dakar -- MARK --
> -----------------------------------------------------
>
> Here's what strace -p says:
>
> on bacula-director-process:
>       dakar: ~ 2# strace -v -p 19051
>       Process 19051 attached - interrupt to quit
>       futex(0x80c5680, FUTEX_WAIT, 2, NULL
> (running bconsole doesn't produce any additional output). It seems to
> wait for something.
>
>
> on bacula-sd-process:
>       dakar: ~ 3# strace -v -p 19099
>       Process 19099 attached - interrupt to quit
>       select(5, [4], NULL, NULL, NULL <unfinished ...>
>       Process 19099 detached
>
>
> on bacula-fd on bali-fd:
>       bali: ~ 4# strace -v -p 3301
>       Process 3301 attached - interrupt to quit
>       select(4, [3], NULL, NULL, NULL <unfinished ...>
>       Process 3301 detached
>
>
> on bacula-fd on paris-fd:
>       paris: ~# strace -v -p 14280
>       Process 14280 attached - interrupt to quit
>       select(4, [3], NULL, NULL, NULL
>
> Something is quite strange about the bacula-fd on paris:
>
> PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 14280 root      16   0  263m 221m 2840 S  0.0 10.9   0:00.02 
> /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf
>
> The bacula-fd is 263M in size!! Hell, why so much?? I don't know if
> this has something to do with the director.
>
> What I'll do right now is restart paris-fd and see what happens:
>
> Results:
> - bacula-fd on paris is normal size
> - bconsole still doesn't connect
> - backup-dir.conmsg contains messages about non-reachable paris:
>
> 11-Jul 23:33 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
> error: Network error with FD during Backup: ERR=No data available
> 11-Jul 23:33 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
> Network error with FD during Backup: ERR=No data available
> 11-Jul 23:34 backup-dir: paris-home.archived.2005-07-11_21.00.00 Fatal
> error: No Job status returned from FD.
> 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Fatal error:
> No Job status returned from FD.
> 11-Jul 23:34 backup-dir: paris-netboot.2005-07-11_21.00.01 Error: Bacula
> 1.36.2 (28Feb05): 11-Jul-2005 23:34:14
>   JobId:                  900
>   Job:                    paris-netboot.2005-07-11_21.00.01
>   Backup Level:           Incremental, since=2005-07-08 21:00:02
>   Client:                 paris-fd
>   FileSet:                "paris-netboot" 2005-06-11 01:35:56
>   Pool:                   "DiffInc"
>   Storage:                "DLT"
>   Start time:             11-Jul-2005 21:00:02
> [.........]
>
> bacula-dir still FUTEX_WAITing:
>       dakar: /var/lib/bacula 9# strace -p 19051
>       Process 19051 attached - interrupt to quit
>       futex(0x80c5680, FUTEX_WAIT, 2, NULL <unfinished ...>
>
> Well, so far. I will restart bacula-dir now und run some jobs manually:
> .....
> After running the jobs and waiting for 10 minutes nothing happens. the
> bacula-dir still works and stat sd says:
>
> Device status:
> Device "/dev/tape" is mounted with Volume "DiffInc-03"
>     Device is BLOCKED waiting for appendable media.
>     Total Bytes Read=0 Blocks Read=0 Bytes/block=0
>     Positioned at File=19 Block=0
> Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
> max bytes/job.
> Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
> bytes.
>
> I'll try unmount and mount:
> *umount
> Automatically selected Storage: DLT
> 3001 Device "/dev/tape" unmounted.
> *
> *mount
> Automatically selected Storage: DLT
> 3001 Device /dev/tape is mounted with Volume "DiffInc-03"
> *
>
> But Again:
>
> Device status:
> Device "/dev/tape" is mounted with Volume "DiffInc-03"
>     Device is BLOCKED waiting for appendable media.
>     Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512
>     Positioned at File=0 Block=0
> Data spooling: 0 active jobs, 0 bytes; 157 total jobs, 46,301,305,898
> max bytes/job.
> Attr spooling: 0 active jobs, 0 bytes; 157 total jobs, 313,408,921 max
> bytes.
>
>
> This time the bacula-sd seems to be stuck. Maybe this causes the
> bacula-fd after hours of waiting to hang???
>
> I'm restarting sd:
>
> 12-Jul 00:01 backup-sd: hanau-web.2005-07-11_23.44.52 Fatal error: Job
> 922 canceled.
> 12-Jul 00:01 backup-sd: donar-home.2005-07-11_23.44.39 Fatal error: Job
> 921 canceled.
> 12-Jul 00:01 backup-sd: paris-home.guest.2005-07-11_23.42.33 Fatal
> error: Job 915 canceled.
> 12-Jul 00:01 backup-sd: paris-home.archived.2005-07-11_23.42.29 Fatal
> error: Job 914 canceled.
> 12-Jul 00:01 backup-sd: caracas.2005-07-11_23.42.18 Fatal error: Job 913
> canceled.
> 12-Jul 00:01 donar-fd: donar-home.2005-07-11_23.44.39 Fatal error:
> job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
> , got 3903 Error append data
>
> 12-Jul 00:01 paris-fd: paris-home.guest.2005-07-11_23.42.33 Fatal error:
> job.c:1665 Bad response to Append Data command. Wanted 3000 OK data
> , got 3903 Error append data
>
> 12-Jul 00:01 paris-fd: paris-home.archived.2005-07-11_23.42.29 Fatal
> error: job.c:1665 Bad response to Append Data command. Wanted 3000 OK
> data
>
> [ ... errors of broken jobs removed ...]
>
> 12-Jul 00:01 backup-dir: Start Backup JobId 918,
> Job=paris-home.staff.3.2005-07-11_23.42.44
> 12-Jul 00:01 backup-dir: Start Backup JobId 919,
> Job=paris-home.prak.2005-07-11_23.42.49
> 12-Jul 00:01 backup-sd: 3301 Issuing autochanger "loaded drive 0"
> command.
> 12-Jul 00:01 backup-sd: 3302 Autochanger "loaded drive 0", result is
> Slot 3.
> 12-Jul 00:01 backup-sd: Volume "DiffInc-03" previously written, moving
> to end of data.
> *
> *
> 12-Jul 00:02 backup-sd: Ready to append to end of Volume "DiffInc-03" at
> file=19.
> 12-Jul 00:02 backup-sd: Spooling data ...
> 12-Jul 00:02 backup-sd: Spooling data ...
>
>
> now it seems to work.... ok, all jobs finished.
>
> Question is: is there a connection between a hanging sd (which I'm not
> sure if it was hanging while the dir was hanging!?) and the hangig dir??
>
> I'll try to get a trace file of the situation!
>
> End Debugging Output
> ----------------------------------------------------------------------

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP, 
AMD, and NVIDIA.  To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to