Re: [Bacula-devel] Query changes in the catalog browser and indexes

Kern Sibbald Sat, 25 Aug 2007 01:31:34 -0700

Hello Marc,

Thanks for the information.  I think this is something that is probably well 
worth the effort to put in the core Bacula code.  I'm going to study it a bit 
to see how it might be organized into a project.  Code like this might also 
help a lot for the simple restores that Bacula currently does by speeding up 
the retrieval.


Perhaps Eric could send me the SQL that creates the tables so that I can see 
more clearly all the fields.

See one minor note below ...

On Saturday 25 August 2007 09:43, Marc Cousin wrote:
> On Saturday 25 August 2007 09:03:49 Kern Sibbald wrote:
> > Hello Marc,
> >
> > Now that we have a number of "browsers" and users are doing a lot more
> > queries, it might be a good idea to consider re-organizing a few of the
> > Bacula tables to improve the efficiency of the queries (and also perhaps
> > backups and restores).
> >
> > Please see more below...
> >
> > On Friday 24 August 2007 19:47, Marc Cousin wrote:
> > ...
> >
> > > > Thank you very much for your willingness to take a look at the
> > > > indexes I am proposing.
> > >
> > > I still have the same issues I discussed with you a few days ago : even
> > > with your new index, everything will look good as long as the SELECT
> > > DISTINCT returns only a very small amount of data (a few hundred to a
> > > few thousand records). It will be unbearable from a GUI point of view
> > > if you have to retrieve hundreds of thousands of directories at once
> > > (or even millions of directories, as the joblist takes the last 20
> > > jobs, even if they are full jobs).
> > > So here we are talking hundreds of thousands of index scans/seeks for
> > > the database (because the dir entries are completely mixed with file
> > > entries in the File table, you probably won't have more than one/a few
> > > dirs per database page).
> > > My point is, if you really want to make it blazing fast (or at least
> > > reasonably fast), there are, I think, very few methods :
> > > - The dir entries should not be drowned in the bulk of file entries (ie
> > > you need a new table to store them and only them, or sort physically
> > > the table on disk so that directories are all stored together on a few
> > > contiguous pages...)
> >
> > I haven't been following this as carefully as I would normally, because
> > of my "mini-vacation", but I am interested in the basis of the problem as
> > far as the Bacula table structure.  I ask that because the Filenames and
> > the Paths are already separated into two tables, and that seems to be
> > what you are proposing.  Can you be a bit more explicit?  Is it that the
> > File entries should be split into two separate tables -- one containing a
> > link to Paths only and one containing a link to Path/Filenames?  If so,
> > is that something that you would propose for Bacula in general?
> >
> >From my point of view, the main problem is that there is no efficient way
> > to
>
> get directly the list of subdirectories of a directory.
> Let's say for instance you want to know for a server which subdirectories
> are in /home.
> You have to :
> - Get all path from Path begining with '/home' (with a like query). This
> will return ALL subdirectories of /home (/home/marc, /home/toto, but
> also /home/marc/.kde/share/apps ...). Even if you do a regexp match, you
> will still have to scan a big chunk of the table, as the index will match
> everythink begining with /home
> - Then you have to find each version of these PathIds from File table, to
> know if they should be visible or not (by jobid). So for the example of a
> /home, let's say you are working in an environment with 100 workstations,
> you will get every possible entry of /home that exists in each of these
> workstation from Path table, then find everyone in File table to know if
> they should be visible. When you have retrieved all these rows, you then
> have 2 choices : - Either you choose to discard most of them (you just want
> to display /home, you don't care for /home/marc/.kde right now), and you'll
> have to redo nearly the same query in the next 10 seconds, or you choose to
> store everything you retrieve in memory. This can be problematic too of
> course, you may have millions of records to retrieve, and very few of them
> will be used.
>
> I didn't even mention 'time navigation' (how to display the same backup
> from a different backup point of view).
>
> I'll describe the solution we chose with Eric below, but here's what we
> felt we need :
> To be able to retrieve only what's needed, and do that very fast, we need :
>       - To be able to retrieve very fast the subdirectories of a directory (it
> means chaining them, for instance one can add a parent_pathid in the path
> table)
>       - To be able to retrieve the 'visibility' of paths. There arised the
> second problem we had : a directory should be displayed if one of it's
> subdirectories is to be displayed (for instance, I decided to
> backup /home/marc and /home/eric, but not /home. /home should be displayed
> nevertheless.)
>
> We felt this should not be done during backup, as this is work that is not
> required for anything but the GUI. But then, updating all the records of
> File and Path table afterwards would lead to a great deal of fragmentation
> of these tables. So we settled for what I describe below
>
> > > - Even better, if you could only retrieve as few entries as possible
> > > each time, it would be even better. It means being able to retrieve
> > > only the subdirectories of the current directory. This isn't feasible
> > > with the current schema either, as you need a way to link directories
> > > with subdirectories (bi-directonnaly if possible).
> >
> > I am also interested in understanding the tables that you and Eric use in
> > brestore to speed up retrieving multiple versions of a file.  Assuming it
> > is a good idea to split the current "File" table into "Dirs" and "Files",
> > do you think it would be a good idea to have the Bacula core build and
> > maintain any other tables for fast lookups?
> >
> > Another possible table would be a list of the "top" level Directories for
> > each Job.  That might significantly speed up retrieving the base
> > directory structure.
>
> Here's what we've done :
> We created two tables :
> -pathvisibility (pathid, jobid)
> -pathhierarchy (ppathid, pathid)
> -knowjobids (i don't remember the realname, eric can correct me on this :)
> )
>
> - pathhierarchy gives us the link between directories and subdirectories.
> - pathvisibility tells us when we should display a pathid (for which jobs)
> Both these tables are updated :
>       - By a cron job run everymorning (it does it for every jobid not in
> knownjobids). It can be quite fast if we do it on a big scale (several
> jobids at once)
>       - When we miss the data (brestore does it if we ask for a jobid that
> hasn't been calculated, ie it's not in knownjobids). Of course we try to
> avoid that, but at least, if it's done, it's only done once. We do all this
> only using 'big queries', not inserting records one by one.
>
> When you have this, wrinting the client becomes quite easy :
> - For the selected date, you determine which jobids should be used (you
> take all incrementals till last last differential, this differential, then
> the closest full)
> - Then for the selected path name, you get it's pathid. With this, you get
> all visible paths from pathvisibility/pathhierarchy (it rarely takes more
> than 10ms), and all fileids from the file table that belong to pathid and
> are visible for the selected jobids
>
> The only hack we had to do was to have our own pruning jobs from the
> database : we sometimes have to insert data into Path table (create the
> /home path in my example), and didn't want it to be cleaned afterwards. 

Hmm. None of the entries in the Path or Filename tables will ever be pruned 
unless all references to them are pruned AND you run a program like dbcheck 
and specifically ask to remove orphans.  This is one part of pruning that 
Bacula does not do because it is *extremely* time consuming.  Not doing this 
kind of pruning is generally not a problem, but could be if you have large 
numbers of files that are created and deleted (certain "illbehaved" email 
programs).

Regards,

Kern

> Of 
> course, with a GUI belonging to the main project, this wouldn't be a
> problem, we could change the pruning algorithm :)
>
> The main advantages we felt using separate table were that :
> - we can delay the display calculation to when the main backups are done,
> so the backups remain fast
> - the pathvisibility and pathhierarchy tables are very compact, so the
> queries on them are very fast. And as they are not used all the time, we
> can even 'CLUSTER' them on postgresql, so as to sort them by ppathid
> physically on disk, to make them even faster.
>
>
> I hope I've been clear enough, don't hesitate to ask me if it's not the
> case :)
> BTW, I think that Eric will try do detail more precisely what we're doing
> this weekend, using the code.
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Bacula-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/bacula-devel

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to