Le Vendredi 19 Mars 2010 19:09:01, Kern Sibbald a écrit :
> Hello,
> 
> I recommend that you submit this as a bug report.  Please include your
> bacula-dir.conf and bacula-sd.conf as well as the two files you included
> here.
> 
> On the timeout for the alert command.  Adding it would require yet another
> Bacula directive to specify the timeout, and that is really a feature
> request (not something we fix as a bug in general).  Also, if the alert
> command takes any time or stalls your system, then you have a hardware or
> software problem that should be fixed.
> 
> If it is the Alert command that is holding things up, then the simplest
> thing to do is to remove the alert command.

Hi Hugh,

An other solution that you can implement quickly is to call a simple wrapper 
script that will do the timeout work and kill the alert command after x secs 
if it hangs.

Bye

> Best regards,
> 
> Kern
> 
> On Friday 19 March 2010 18:53:47 Hugh Brown wrote:
> > (Sorry, once more with actual attachments.)
> > 
> > Kern Sibbald wrote:
> > > At this point, before sending anything, first, please ensure the patch
> > > is applied.  If so, 90% probability you will not have any more
> > > problems.  If you do, the lock manager will produce a nice dump with
> > > additional information -- if it is not emailed to you, you should find
> > > two files in your working directory that contain the traceback and the
> > > bactrace outputs.
> > > 
> > > The lock manager will not prevent lockups, but it will detect deadlock
> > > situations, and then blow up the SD so as to produce a useful dump.
> > 
> > I ran into this problem again last night (symptoms: no response after
> > "Used volume status:" when running "status storage" on bconsole; extra
> > bacula-sd process, which strace shows is running futex() over and over
> > again), and managed to get the traceback and the lock dump.
> > 
> > Unfortunately, the deadlock detection did not seem to work; I left
> > things hung for about 20 minutes or so before running "kill -6" on the
> > parent SD process.  (That still left the child, so I had to "kill -9"
> > that one.)  However, I'm hoping that the info is still useful; if so,
> > let me know and I'll file a bug.
> > 
> > And now for some uninformed speculation:
> > 
> > Looking at the backtrace and the lock dump, it seems that one thread
> > (0x4519d940) held the two locks that were being waited for by other
> > threads.  In turn, that lock-holding thread had finished a job (jcr
> > 0x12a92d58), ran release_device (dcr 0x12ac15b8), and was running the
> > alert command (which is set to "sh -c '/usr/sbin/smartctl -H -l error
> > -q errorsonly -d scsi %c'") and waiting for output.  I'm assuming that
> > the thread was still waiting for this when I killed it.
> > 
> > Looking at the code for release_device(), it seems that the alert
> > 
> > command is called without the optional watchdog timer:
> >       alert = get_pool_memory(PM_FNAME);
> >       alert = edit_device_codes(dcr, alert, dcr->device->alert_command,
> > 
> > ""); bpipe = open_bpipe(alert, 0, "r");
> > 
> > (Line 529-531 in acquire.c, version 5.0.1)
> > 
> > Obviously, if something's wrong w/the alert command or my hardware,
> > that's bad.  But would it be a good thing to call the alert command
> > with, say, a 60-second watchdog timer to avoid this kind of problem?
> > If there are other issues at work that make this just a workaround,
> > wouldn't it still be good to be alerted that there's a problem?  (Here
> > I'm assuming that either the lock manager could do this (and
> > kill/segfault the process, producing a backtrace), or that the timeout
> > could be caught w/o greater harm and turned into a log message.)
> > 
> > Natch, I'm not a programmer (let alone a Bacula dev), nor do I play
> > one on TV, but I'm very curious about what's going on under the hood.
> > If I've mistaken something or missed the point entirely, I'd be
> > grateful if someone could point it out.
> > 
> > Thanks again for your time!
> > 
> > --
> > Hugh Brown, Systems Manager
> > The Centre for High-Throughput Biology
> > [email protected]
> 
> ---------------------------------------------------------------------------
> --- Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Bacula-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/bacula-devel

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to