Hello,

As you probably know, I have been devoting a huge amount of time to debugging 
lately, and it is likely to continue for another month or so.  This is 
because we have recently had a flurry of bug reports (mostly old bugs) and in 
the process of fixing some of them, I have written new regression scripts 
that have pointed out at least four additional bugs.

One of the problems is that Bacula is much bigger and hence more complicated, 
and the bugs that we are getting are often race conditions that involve 
multiple simultaneous jobs.  

While debugging, I have developed a number of new techniques, the most 
important has been to add the JobId to debug output.  I implemented several 
ways of printing the JobId, and then finally threw all that code out and 
implemented a new cleaner way of doing so.  This is some 2000+ lines of diff 
code that I will commit to the SVN in the next couple of hours.

The main purpose of this email is to explain the new code a bit, and to warn 
you that there may be a few glitches that remain to be fixed.

The old code, simply either printed the JobId in debug output by having a jcr 
(job control record) which contains the JobId, or if no jcr was available 
(typically lower level routines), it would call a subroutine 
get_jobid_from_tid(), which would search the jcr chain for a jcr that was 
being run by the current thread.  This required a lot of code to add the 
jobid to each debug message, and it had a big performance hit from doing the 
search of the jcrs, and finally, it would fail if certain jcr debug code was 
turned on by getting itself into a deadlock with the jcr chain locking or a 
recursive loop.

The new code throws all that out, and now stores the jcr pointer in thread 
specific data.  The result is that it is always possible to obtain a pointer 
to the jcr for the currently running thread, and this is quite efficient too.
I've modified the Jmsg() and Emsg() routines to automatically look up the jcr 
is none is supplied, and I have modified the Dmsg() debug code to print the 
JobId automatically.  The old Dmsg() code printed output that looked a bit 
like:

rufus-sd: reserve.c:999 Try match res=File

where reserve.c:999 is the file and line that generated the message.  The new 
code prints:

rufus-sd: reserve.c:999-54 Try match res=File

where the extra "-54" is the JobId.  It isn't particularly useful in a single 
job output, but when multiple threads are running, it is *really* nice.

If the JobId is zero, it means that the thread that printed that message has 
no JobId associated with it.

Some problems:
There are a number of cases in Bacula where a thread can switch from one JobId 
to another one -- the most common one is in the jobq.c code that does the 
resource scheduling in the Director.  The other important place is in the 
Storage daemon message thread that runs in the Director.   I think I have 
found *most* places in the core code where a thread can handle multiple jobs 
and have fixed them appropriately, but there remain a few places where some 
work must be done:  one is the catalog call backs from the SD to the Dir, and 
the other is in some of the utility programs such as bscan.  I imagine we 
will treat those as they show up and become a problem.

The main thing to be aware of is that in those odd cases where a thread deals 
with more than one job, it is possible that the thread specific data points 
to the wrong jcr, or even to memory that has been released, and in those 
cases, the debug code will not work correctly.

The upside of all this is that the code is now much simpler, much more 
efficient, and when we are sure it works correctly, we will be able to remove 
the passing of the jcr to a lot of subroutines where it is passed only to be 
able to direct error messages to the right place.

Best regards,

Kern

  

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to