Hi,

Le Mercredi 17 Mars 2010 00:17:59, Hugh Brown a écrit :
> This is a complicated problem; apologies in advance if there's any
> missing information.
> 
> I'm running Bacula 5.0.1 on CentOS 5.4, x86_64.  I came back from a
> week's vacation today to discover that the storage daemon had become
> hung one day after my vacation started. :-(
> 
> Three jobs were running, and everything else was stacked up behind
> that, waiting on Max Storage Jobs (currently set to 3).  There were
> two bacula-sd processes listed: one old (dating from the last reboot
> of the machine) and one young (dated from the time the hung jobs had
> run).
> 
> I installed the debuginfo RPM, and ran btraceback (output attached)
> with the PID of the younger job as an argument.  After that, I tried
> "kill -3 [younger PID]", and when that didn't work ran "kill -9
> [younger PID]".  At that point, surprisingly, the three hung jobs
> finished, and the jobs that had been waiting to run began to run. From
> the reports (attached), it looked like the director had tried to kill
> the jobs since they'd run far too long, but this evidently did not
> succeed:
> 
>       Error: Watchdog sending kill after 518427 secs to thread stalled reading
> Storage daemon.
> 
> After doing some searching, I came across bug #1527
> (http://bugs.bacula.org/view.php?id=1527), which looks similar to
> problem in one respect: the output of "status storage" in bconsole
> just hung when it got to "Used volume status".  (I'm afraid I did not
> keep a copy of the output.)  However, the tracebacks from that bug
> look different from mine, so I'm not sure that it's the same.


Yes, but your backtrace looks very strange, so i'm not sure that we can trust 
it.
 
> As I mentioned, I came across this bug a week after it occurred
> (sigh), so my ability to get more info is limited.  I will be running
> backups again tonight and will be watching closely; I've added
> monitoring for big stacks of long-running jobs, which should hopefully
> catch this if it happens again.
> 
> My questions are:
> 
> -- Is this backtrace worth submitting as a bug report?
> 
> -- Does this look like the same problem reported in #1527?  If so,
> should I recompile bacula with the lockmgr option as shown in the bug
> report?

You can also apply the attached patch.


> -- The director tried to kill the long-standing job but failed.  Is
> this just another symptom of a deadlock in bacula-sd, or is there
> something else going on?

Without a good backtrace on all compoments, it's hard to say.

Once you applied the patch, turn on the lockmanager, you can submit a bug if 
you find a dead lock.

Bye

> Thanks in advance for any advice you can give, and please let me know
> if you need any further info.
> 
> --
> Hugh Brown, Systems Manager
> The Centre for High-Throughput Biology
> [email protected]

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to