Re: [Bacula-devel] Problem with SD hang in 5.0.1

Hugh Brown Fri, 19 Mar 2010 10:54:48 -0700

Kern Sibbald wrote:
> At this point, before sending anything, first, please ensure the patch is
> applied.  If so, 90% probability you will not have any more problems.  If you
> do, the lock manager will produce a nice dump with additional information --
> if it is not emailed to you, you should find two files in your working
> directory that contain the traceback and the bactrace outputs.
>
> The lock manager will not prevent lockups, but it will detect deadlock
> situations, and then blow up the SD so as to produce a useful dump.


I ran into this problem again last night (symptoms: no response after
"Used volume status:" when running "status storage" on bconsole; extra
bacula-sd process, which strace shows is running futex() over and over
again), and managed to get the traceback and the lock dump.

Unfortunately, the deadlock detection did not seem to work; I left
things hung for about 20 minutes or so before running "kill -6" on the
parent SD process.  (That still left the child, so I had to "kill -9"
that one.)  However, I'm hoping that the info is still useful; if so,
let me know and I'll file a bug.

And now for some uninformed speculation:

Looking at the backtrace and the lock dump, it seems that one thread
(0x4519d940) held the two locks that were being waited for by other
threads.  In turn, that lock-holding thread had finished a job (jcr
0x12a92d58), ran release_device (dcr 0x12ac15b8), and was running the
alert command (which is set to "sh -c '/usr/sbin/smartctl -H -l error
-q errorsonly -d scsi %c'") and waiting for output.  I'm assuming that
the thread was still waiting for this when I killed it.

Looking at the code for release_device(), it seems that the alert
command is called without the optional watchdog timer:

      alert = get_pool_memory(PM_FNAME);
      alert = edit_device_codes(dcr, alert, dcr->device->alert_command, "");
      bpipe = open_bpipe(alert, 0, "r");

(Line 529-531 in acquire.c, version 5.0.1)

Obviously, if something's wrong w/the alert command or my hardware,
that's bad.  But would it be a good thing to call the alert command
with, say, a 60-second watchdog timer to avoid this kind of problem?
If there are other issues at work that make this just a workaround,
wouldn't it still be good to be alerted that there's a problem?  (Here
I'm assuming that either the lock manager could do this (and
kill/segfault the process, producing a backtrace), or that the timeout
could be caught w/o greater harm and turned into a log message.)

Natch, I'm not a programmer (let alone a Bacula dev), nor do I play
one on TV, but I'm very curious about what's going on under the hood.
If I've mistaken something or missed the point entirely, I'd be
grateful if someone could point it out.

Thanks again for your time!

--
Hugh Brown, Systems Manager
The Centre for High-Throughput Biology
[email protected]

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Problem with SD hang in 5.0.1

Reply via email to