Re: [Bacula-devel] Query changes in the catalog browser and indexes

Marc Cousin Sat, 25 Aug 2007 12:04:53 -0700

On Saturday 25 August 2007 17:48:22 Dirk Bartley wrote:
> On Sat, 2007-08-25 at 09:43 +0200, Marc Cousin wrote:
> > On Saturday 25 August 2007 09:03:49 Kern Sibbald wrote:
> > > Hello Marc,
> > >
> > > Now that we have a number of "browsers" and users are doing a lot more
> > > queries, it might be a good idea to consider re-organizing a few of the
> > > Bacula tables to improve the efficiency of the queries (and also
> > > perhaps backups and restores).
> > >
> > > Please see more below...
> > >
> > > On Friday 24 August 2007 19:47, Marc Cousin wrote:
> > > ...
> > >
> > > > > Thank you very much for your willingness to take a look at the
> > > > > indexes I am proposing.
> > > >
> > > > I still have the same issues I discussed with you a few days ago :
> > > > even with your new index, everything will look good as long as the
> > > > SELECT DISTINCT returns only a very small amount of data (a few
> > > > hundred to a few thousand records). It will be unbearable from a GUI
> > > > point of view if you have to retrieve hundreds of thousands of
> > > > directories at once (or even millions of directories, as the joblist
> > > > takes the last 20 jobs, even if they are full jobs).
> > > > So here we are talking hundreds of thousands of index scans/seeks for
> > > > the database (because the dir entries are completely mixed with file
> > > > entries in the File table, you probably won't have more than one/a
> > > > few dirs per database page).
> > > > My point is, if you really want to make it blazing fast (or at least
> > > > reasonably fast), there are, I think, very few methods :
> > > > - The dir entries should not be drowned in the bulk of file entries
> > > > (ie you need a new table to store them and only them, or sort
> > > > physically the table on disk so that directories are all stored
> > > > together on a few contiguous pages...)
> > >
> > > I haven't been following this as carefully as I would normally, because
> > > of my "mini-vacation", but I am interested in the basis of the problem
> > > as far as the Bacula table structure.  I ask that because the Filenames
> > > and the Paths are already separated into two tables, and that seems to
> > > be what you are proposing.  Can you be a bit more explicit?  Is it that
> > > the File entries should be split into two separate tables -- one
> > > containing a link to Paths only and one containing a link to
> > > Path/Filenames?  If so, is that something that you would propose for
> > > Bacula in general?
> > >
> > >From my point of view, the main problem is that there is no efficient
> > > way to
> >
> > get directly the list of subdirectories of a directory.
> > Let's say for instance you want to know for a server which subdirectories
> > are in /home.
> > You have to :
> > - Get all path from Path begining with '/home' (with a like query). This
> > will return ALL subdirectories of /home (/home/marc, /home/toto, but
> > also /home/marc/.kde/share/apps ...). Even if you do a regexp match, you
> > will still have to scan a big chunk of the table, as the index will match
> > everythink begining with /home
> > - Then you have to find each version of these PathIds from File table, to
> > know if they should be visible or not (by jobid). So for the example of a
> > /home, let's say you are working in an environment with 100 workstations,
> > you will get every possible entry of /home that exists in each of these
> > workstation from Path table, then find everyone in File table to know if
> > they should be visible. When you have retrieved all these rows, you then
> > have 2 choices : - Either you choose to discard most of them (you just
> > want to display /home, you don't care for /home/marc/.kde right now), and
> > you'll have to redo nearly the same query in the next 10 seconds, or you
> > choose to store everything you retrieve in memory. This can be
> > problematic too of course, you may have millions of records to retrieve,
> > and very few of them will be used.
> >
> > I didn't even mention 'time navigation' (how to display the same backup
> > from a different backup point of view).
> >
> > I'll describe the solution we chose with Eric below, but here's what we
> > felt we need :
> > To be able to retrieve only what's needed, and do that very fast, we need
> > : - To be able to retrieve very fast the subdirectories of a directory
> > (it means chaining them, for instance one can add a parent_pathid in the
> > path table)
> >     - To be able to retrieve the 'visibility' of paths. There arised the
> > second problem we had : a directory should be displayed if one of it's
> > subdirectories is to be displayed (for instance, I decided to
> > backup /home/marc and /home/eric, but not /home. /home should be
> > displayed nevertheless.)
> >
> > We felt this should not be done during backup, as this is work that is
> > not required for anything but the GUI. But then, updating all the records
> > of File and Path table afterwards would lead to a great deal of
> > fragmentation of these tables. So we settled for what I describe below
> >
> > > > - Even better, if you could only retrieve as few entries as possible
> > > > each time, it would be even better. It means being able to retrieve
> > > > only the subdirectories of the current directory. This isn't feasible
> > > > with the current schema either, as you need a way to link directories
> > > > with subdirectories (bi-directonnaly if possible).
> > >
> > > I am also interested in understanding the tables that you and Eric use
> > > in brestore to speed up retrieving multiple versions of a file. 
> > > Assuming it is a good idea to split the current "File" table into
> > > "Dirs" and "Files", do you think it would be a good idea to have the
> > > Bacula core build and maintain any other tables for fast lookups?
> > >
> > > Another possible table would be a list of the "top" level Directories
> > > for each Job.  That might significantly speed up retrieving the base
> > > directory structure.
> >
> > Here's what we've done :
> > We created two tables :
> > -pathvisibility (pathid, jobid)
> > -pathhierarchy (ppathid, pathid)
> > -knowjobids (i don't remember the realname, eric can correct me on this
> > :) )
> >
> > - pathhierarchy gives us the link between directories and subdirectories.
> > - pathvisibility tells us when we should display a pathid (for which
> > jobs) Both these tables are updated :
> >     - By a cron job run everymorning (it does it for every jobid not in
> > knownjobids). It can be quite fast if we do it on a big scale (several
> > jobids at once)
> >     - When we miss the data (brestore does it if we ask for a jobid that
> > hasn't been calculated, ie it's not in knownjobids). Of course we try to
> > avoid that, but at least, if it's done, it's only done once. We do all
> > this only using 'big queries', not inserting records one by one.
>
> Just looking at this I can tell that it would be very easy for me to
> modify the version browser to be considerably faster with these two
> tables.  The suggestion to make it easier of adding a top level
> directory I agree with.  Not sure if it would NEED to be it's own table.
> Maybe if pathvisibility had (pathid, jobid, isroot).  Then index the
> table on isroot.  This would still provide possibilities for queries
> that would be quite fast.
>
> These ideas I like.


I don't really see the point of tagging a directory as the root directory. The 
approach we took was : 
- '' (null string) is the root dir for every server. It's only subdir is / on 
unix. It contains c: d: ... on windows. It's just that this way, your file 
browser opens with empty path, which displays a nice '/' for your unix 
server, and a nice c: for windows, using exactly the same code for both.
- is a path visible? We display a path with no date/properties when it's there  
only because it has subdirectories which are in a selected backup (so it's 
calculated as visible in pathvisibility, even if it isn't really in the 
backup)

Can you explain how you want to do it ?

>
>
>
> So these scripts to populate these tables for brestore, are they perl
> scripts run in cron jobs??

For now, it's part of brestore, and you can run brestore with no gui (a 
command line argument) to populate (and create) these tables, and we run it 
with cron.
Of course, the ideal way of doing it would be to ask the director to do the 
job by himself (a new type of job ?), so it could be scheduled for instance 
just after a group of backups, or just after the last one. It would feel more 
integrated. With the added benefit of having a service you can call from the 
GUI to create the data for an uncalculated job.
The best way, from my point of view, would be of having everything done at the 
director layer : 
having something along the lines of 
'set_target_date()' to choose the jobids you want to work on
'cd', 'ls', etc ... (all usual fs calls) then displaying what should be seen 
for the selected jobids.
I've not gone deeply enough into the director's code to know if it would be 
easy or the good way to do it...

Anyhow, most of brestore's code is SQL, with perl around it. It should be 
portable to C, even if more burdensome. Most of the dirty work is done in 
SQL, perl is mostly for concatenating strings for the IN clauses...
The ugly and difficult part of brestore was bulding the bsr file, and we've 
done it in perl only because we couldn't find a way of doing it directly in 
the director at that moment... I don't know if it's doable directly in the 
director now (telling him : i want this directory as seen by these jobids, 
and this file, etc ... and ask him to create a bsr that does all this in one 
go). It would be much better beacause it's duplicate code between the 
director and brestore, and brestore's code for this sucks :)


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to