On Friday 19 March 2010 19:19:29 Eric Bollengier wrote:
> Le Vendredi 19 Mars 2010 19:09:01, Kern Sibbald a écrit :
> > Hello,
> >
> > I recommend that you submit this as a bug report.  Please include your
> > bacula-dir.conf and bacula-sd.conf as well as the two files you included
> > here.
> >
> > On the timeout for the alert command.  Adding it would require yet
> > another Bacula directive to specify the timeout, and that is really a
> > feature request (not something we fix as a bug in general).  Also, if the
> > alert command takes any time or stalls your system, then you have a
> > hardware or software problem that should be fixed.
> >
> > If it is the Alert command that is holding things up, then the simplest
> > thing to do is to remove the alert command.
>
> Hi Hugh,
>
> An other solution that you can implement quickly is to call a simple
> wrapper script that will do the timeout work and kill the alert command
> after x secs if it hangs.

Yes, good point Eric.

By the way, Hugh: If you are 99.9% sure that the problem comes from "alert" 
please don't submit a bug report.  If there is a race condition, we 
definitely would like to see it.

Thanks,

Kern

>
> Bye
>
> > Best regards,
> >
> > Kern
> >
> > On Friday 19 March 2010 18:53:47 Hugh Brown wrote:
> > > (Sorry, once more with actual attachments.)
> > >
> > > Kern Sibbald wrote:
> > > > At this point, before sending anything, first, please ensure the
> > > > patch is applied.  If so, 90% probability you will not have any more
> > > > problems.  If you do, the lock manager will produce a nice dump with
> > > > additional information -- if it is not emailed to you, you should
> > > > find two files in your working directory that contain the traceback
> > > > and the bactrace outputs.
> > > >
> > > > The lock manager will not prevent lockups, but it will detect
> > > > deadlock situations, and then blow up the SD so as to produce a
> > > > useful dump.
> > >
> > > I ran into this problem again last night (symptoms: no response after
> > > "Used volume status:" when running "status storage" on bconsole; extra
> > > bacula-sd process, which strace shows is running futex() over and over
> > > again), and managed to get the traceback and the lock dump.
> > >
> > > Unfortunately, the deadlock detection did not seem to work; I left
> > > things hung for about 20 minutes or so before running "kill -6" on the
> > > parent SD process.  (That still left the child, so I had to "kill -9"
> > > that one.)  However, I'm hoping that the info is still useful; if so,
> > > let me know and I'll file a bug.
> > >
> > > And now for some uninformed speculation:
> > >
> > > Looking at the backtrace and the lock dump, it seems that one thread
> > > (0x4519d940) held the two locks that were being waited for by other
> > > threads.  In turn, that lock-holding thread had finished a job (jcr
> > > 0x12a92d58), ran release_device (dcr 0x12ac15b8), and was running the
> > > alert command (which is set to "sh -c '/usr/sbin/smartctl -H -l error
> > > -q errorsonly -d scsi %c'") and waiting for output.  I'm assuming that
> > > the thread was still waiting for this when I killed it.
> > >
> > > Looking at the code for release_device(), it seems that the alert
> > >
> > > command is called without the optional watchdog timer:
> > >       alert = get_pool_memory(PM_FNAME);
> > >       alert = edit_device_codes(dcr, alert, dcr->device->alert_command,
> > >
> > > ""); bpipe = open_bpipe(alert, 0, "r");
> > >
> > > (Line 529-531 in acquire.c, version 5.0.1)
> > >
> > > Obviously, if something's wrong w/the alert command or my hardware,
> > > that's bad.  But would it be a good thing to call the alert command
> > > with, say, a 60-second watchdog timer to avoid this kind of problem?
> > > If there are other issues at work that make this just a workaround,
> > > wouldn't it still be good to be alerted that there's a problem?  (Here
> > > I'm assuming that either the lock manager could do this (and
> > > kill/segfault the process, producing a backtrace), or that the timeout
> > > could be caught w/o greater harm and turned into a log message.)
> > >
> > > Natch, I'm not a programmer (let alone a Bacula dev), nor do I play
> > > one on TV, but I'm very curious about what's going on under the hood.
> > > If I've mistaken something or missed the point entirely, I'd be
> > > grateful if someone could point it out.
> > >
> > > Thanks again for your time!
> > >
> > > --
> > > Hugh Brown, Systems Manager
> > > The Centre for High-Throughput Biology
> > > [email protected]
> >
> > -------------------------------------------------------------------------
> >-- --- Download Intel® Parallel Studio Eval
> > Try the new software tools for yourself. Speed compiling, find bugs
> > proactively, and fine-tune applications for parallel performance.
> > See why Intel Parallel Studio got high marks during beta.
> > http://p.sf.net/sfu/intel-sw-dev
> > _______________________________________________
> > Bacula-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/bacula-devel
>
> ---------------------------------------------------------------------------
>--- Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Bacula-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/bacula-devel



------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to