On Friday 19 March 2010 19:19:29 Eric Bollengier wrote: > Le Vendredi 19 Mars 2010 19:09:01, Kern Sibbald a écrit : > > Hello, > > > > I recommend that you submit this as a bug report. Please include your > > bacula-dir.conf and bacula-sd.conf as well as the two files you included > > here. > > > > On the timeout for the alert command. Adding it would require yet > > another Bacula directive to specify the timeout, and that is really a > > feature request (not something we fix as a bug in general). Also, if the > > alert command takes any time or stalls your system, then you have a > > hardware or software problem that should be fixed. > > > > If it is the Alert command that is holding things up, then the simplest > > thing to do is to remove the alert command. > > Hi Hugh, > > An other solution that you can implement quickly is to call a simple > wrapper script that will do the timeout work and kill the alert command > after x secs if it hangs.
Yes, good point Eric. By the way, Hugh: If you are 99.9% sure that the problem comes from "alert" please don't submit a bug report. If there is a race condition, we definitely would like to see it. Thanks, Kern > > Bye > > > Best regards, > > > > Kern > > > > On Friday 19 March 2010 18:53:47 Hugh Brown wrote: > > > (Sorry, once more with actual attachments.) > > > > > > Kern Sibbald wrote: > > > > At this point, before sending anything, first, please ensure the > > > > patch is applied. If so, 90% probability you will not have any more > > > > problems. If you do, the lock manager will produce a nice dump with > > > > additional information -- if it is not emailed to you, you should > > > > find two files in your working directory that contain the traceback > > > > and the bactrace outputs. > > > > > > > > The lock manager will not prevent lockups, but it will detect > > > > deadlock situations, and then blow up the SD so as to produce a > > > > useful dump. > > > > > > I ran into this problem again last night (symptoms: no response after > > > "Used volume status:" when running "status storage" on bconsole; extra > > > bacula-sd process, which strace shows is running futex() over and over > > > again), and managed to get the traceback and the lock dump. > > > > > > Unfortunately, the deadlock detection did not seem to work; I left > > > things hung for about 20 minutes or so before running "kill -6" on the > > > parent SD process. (That still left the child, so I had to "kill -9" > > > that one.) However, I'm hoping that the info is still useful; if so, > > > let me know and I'll file a bug. > > > > > > And now for some uninformed speculation: > > > > > > Looking at the backtrace and the lock dump, it seems that one thread > > > (0x4519d940) held the two locks that were being waited for by other > > > threads. In turn, that lock-holding thread had finished a job (jcr > > > 0x12a92d58), ran release_device (dcr 0x12ac15b8), and was running the > > > alert command (which is set to "sh -c '/usr/sbin/smartctl -H -l error > > > -q errorsonly -d scsi %c'") and waiting for output. I'm assuming that > > > the thread was still waiting for this when I killed it. > > > > > > Looking at the code for release_device(), it seems that the alert > > > > > > command is called without the optional watchdog timer: > > > alert = get_pool_memory(PM_FNAME); > > > alert = edit_device_codes(dcr, alert, dcr->device->alert_command, > > > > > > ""); bpipe = open_bpipe(alert, 0, "r"); > > > > > > (Line 529-531 in acquire.c, version 5.0.1) > > > > > > Obviously, if something's wrong w/the alert command or my hardware, > > > that's bad. But would it be a good thing to call the alert command > > > with, say, a 60-second watchdog timer to avoid this kind of problem? > > > If there are other issues at work that make this just a workaround, > > > wouldn't it still be good to be alerted that there's a problem? (Here > > > I'm assuming that either the lock manager could do this (and > > > kill/segfault the process, producing a backtrace), or that the timeout > > > could be caught w/o greater harm and turned into a log message.) > > > > > > Natch, I'm not a programmer (let alone a Bacula dev), nor do I play > > > one on TV, but I'm very curious about what's going on under the hood. > > > If I've mistaken something or missed the point entirely, I'd be > > > grateful if someone could point it out. > > > > > > Thanks again for your time! > > > > > > -- > > > Hugh Brown, Systems Manager > > > The Centre for High-Throughput Biology > > > [email protected] > > > > ------------------------------------------------------------------------- > >-- --- Download Intel® Parallel Studio Eval > > Try the new software tools for yourself. Speed compiling, find bugs > > proactively, and fine-tune applications for parallel performance. > > See why Intel Parallel Studio got high marks during beta. > > http://p.sf.net/sfu/intel-sw-dev > > _______________________________________________ > > Bacula-devel mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/bacula-devel > > --------------------------------------------------------------------------- >--- Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Bacula-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/bacula-devel ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
