Hello, As you probably know, I have been devoting a huge amount of time to debugging lately, and it is likely to continue for another month or so. This is because we have recently had a flurry of bug reports (mostly old bugs) and in the process of fixing some of them, I have written new regression scripts that have pointed out at least four additional bugs.
One of the problems is that Bacula is much bigger and hence more complicated, and the bugs that we are getting are often race conditions that involve multiple simultaneous jobs. While debugging, I have developed a number of new techniques, the most important has been to add the JobId to debug output. I implemented several ways of printing the JobId, and then finally threw all that code out and implemented a new cleaner way of doing so. This is some 2000+ lines of diff code that I will commit to the SVN in the next couple of hours. The main purpose of this email is to explain the new code a bit, and to warn you that there may be a few glitches that remain to be fixed. The old code, simply either printed the JobId in debug output by having a jcr (job control record) which contains the JobId, or if no jcr was available (typically lower level routines), it would call a subroutine get_jobid_from_tid(), which would search the jcr chain for a jcr that was being run by the current thread. This required a lot of code to add the jobid to each debug message, and it had a big performance hit from doing the search of the jcrs, and finally, it would fail if certain jcr debug code was turned on by getting itself into a deadlock with the jcr chain locking or a recursive loop. The new code throws all that out, and now stores the jcr pointer in thread specific data. The result is that it is always possible to obtain a pointer to the jcr for the currently running thread, and this is quite efficient too. I've modified the Jmsg() and Emsg() routines to automatically look up the jcr is none is supplied, and I have modified the Dmsg() debug code to print the JobId automatically. The old Dmsg() code printed output that looked a bit like: rufus-sd: reserve.c:999 Try match res=File where reserve.c:999 is the file and line that generated the message. The new code prints: rufus-sd: reserve.c:999-54 Try match res=File where the extra "-54" is the JobId. It isn't particularly useful in a single job output, but when multiple threads are running, it is *really* nice. If the JobId is zero, it means that the thread that printed that message has no JobId associated with it. Some problems: There are a number of cases in Bacula where a thread can switch from one JobId to another one -- the most common one is in the jobq.c code that does the resource scheduling in the Director. The other important place is in the Storage daemon message thread that runs in the Director. I think I have found *most* places in the core code where a thread can handle multiple jobs and have fixed them appropriately, but there remain a few places where some work must be done: one is the catalog call backs from the SD to the Dir, and the other is in some of the utility programs such as bscan. I imagine we will treat those as they show up and become a problem. The main thing to be aware of is that in those odd cases where a thread deals with more than one job, it is possible that the thread specific data points to the wrong jcr, or even to memory that has been released, and in those cases, the debug code will not work correctly. The upside of all this is that the code is now much simpler, much more efficient, and when we are sure it works correctly, we will be able to remove the passing of the jcr to a lot of subroutines where it is passed only to be able to direct error messages to the right place. Best regards, Kern ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
