For long, I experienced the irritation of the Bacula-DIR not being killable at once on my system. (FreeBSD 5.5 and 6.3, Bacula 2.2.7 - 2.2.9-b2).
Now I gave it a try to exactly figure out what is giong on (or going wrong). And by that I found three things, none of them seems to have an easy solution. :( [All following code snippets are taken from 2.2.9-b3, as this is just what I have here.] 1.) When orderly signalled to shutdown via SigTERM, the DIR will invoke the function terminate_dird() in file dird/dird.c. Here various cleanup-functions are invoked; one of them is term_scheduler(). This function is where the shutdown will stop and the DIR will start to loop forever. term_scheduler() in dird/scheduler.c looks like this: void term_scheduler() { if (jobs_to_run) { job_item *je; /* Release all queued job entries to be run */ foreach_dlist(je, jobs_to_run) { free(je); } delete jobs_to_run; } } This basically does not work. What I am experiencing is that the "foreach" loop is executed only twice, and the second time the address that is given to free() reads "0xAAAAAAAA" - which is obviousely not a correct address. And then it hangs. And if I comment out the "free()", then it does not hang. So, what we are doing here is practically something like this: while(je = next(je)) free(je); we free a memory-chunk and *then* use this very memory-chunk's address as a reference to figure out our next memory-chunk. This would not be a problem if the address were not contained in the memory-chunk itself. But as it seems, it is. It would also not be a problem if we would use a genuine Unix free(), since this does only free the memory-chunk, while the data contained therein can still be used. But in fact we have rewired free() to point to sm_free() in lib/smartall.c. And *this* free() does a memset(target, 0xAA) - which explains a bit. Now, for the solution: there are a couple of possibilities to rearrange that algorithm in a way that would avoid this effect. Most of them seem to work, but all which I have tried show another misbehaviour: there is always one memory-chunk remaining, which then gets reported as "orphaned" at the end of the shutdown. It seems, this chunk does not even show up when walking the scheduler dlist. I would suppose that it has become orphaned already earlier in the program run by some other effect. 2.) The second interesting question is, why does the DIR begin to loop forever from that point on? Actually the sm_free() should detect that something is wrong with the 0xAAAAAAAA address, and should create an ABORT condition - which then gets propagated to a segmentation violation. In fact it possibly tries to do that - but this will not work: Due to the shutdown-initiating SigTERM we are in a signal handler! And our respective sa_mask had been set (from init_signals() in lib/signal.c) by sigfillmask() - that means: block all signals! Now we have a funny condition: our process has segfaulted and is likely no longer runnable - but it postpones the acceptance of the SEGV signal. I do not think it is well defined how a kernel should handle such situation. Mine continues to process sigtraps as spare CPU allows. Others may simply get rid of the crap. Again, a solution is not all too simple. And there is another problem: the BSD manpage says this about signal handlers: > [certain number of Unix library functions deleted] > > All functions not in the above lists are considered to be unsafe > with respect to signals. That is to say, the behaviour of such > functions when called from a signal handler is undefined. In > general though, signal handlers should do little more than set a > flag; most other actions are not safe. Now, in the DIR, as far as I understand it, the whole elaborate shutdown process, calling lots of functions not mentioned in that list, is all done within the signal handler. Therefore, I would not consider it useful to now create a suitable sigaction configuration for all demands; because chances are that the thing would just not behave as expected. For now, I have changed the sigfillmask() to sigemptymask() - this seemingly does not provide handling for the SigSEGV, but it does terminate the process at the point. 3. While investigating these things, perchance I was wondering about why the Director's PID-file did not get deleted on termination. Now there is a simple explanation: after switching to operation priviledges (which happens after the creation of the pid file) the director does no longer have the right to delete it. But again, solution is not simple. Because, if it were created after switching the credentials, usually there would be lack of permissions to create it in the standard /var/run directory, and an exclusive subdirectory would be needed. The latter seems to be de-facto standard for credential-switching daemons, but it adds another step to installation processing. rgds, PMc ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel