On Sat, 2007-08-25 at 09:43 +0200, Marc Cousin wrote:
> On Saturday 25 August 2007 09:03:49 Kern Sibbald wrote:
> > Hello Marc,
> >
> > Now that we have a number of "browsers" and users are doing a lot more
> > queries, it might be a good idea to consider re-organizing a few of the
> > Bacula tables to improve the efficiency of the queries (and also perhaps
> > backups and restores).
> >
> > Please see more below...
> >
> > On Friday 24 August 2007 19:47, Marc Cousin wrote:
> > ...
> >
> > > > Thank you very much for your willingness to take a look at the indexes
> > > > I am proposing.
> > >
> > > I still have the same issues I discussed with you a few days ago : even
> > > with your new index, everything will look good as long as the SELECT
> > > DISTINCT returns only a very small amount of data (a few hundred to a few
> > > thousand records). It will be unbearable from a GUI point of view if you
> > > have to retrieve hundreds of thousands of directories at once (or even
> > > millions of directories, as the joblist takes the last 20 jobs, even if
> > > they are full jobs).
> > > So here we are talking hundreds of thousands of index scans/seeks for the
> > > database (because the dir entries are completely mixed with file entries
> > > in the File table, you probably won't have more than one/a few dirs per
> > > database page).
> > > My point is, if you really want to make it blazing fast (or at least
> > > reasonably fast), there are, I think, very few methods :
> > > - The dir entries should not be drowned in the bulk of file entries (ie
> > > you need a new table to store them and only them, or sort physically the
> > > table on disk so that directories are all stored together on a few
> > > contiguous pages...)
> >
> > I haven't been following this as carefully as I would normally, because of
> > my "mini-vacation", but I am interested in the basis of the problem as far
> > as the Bacula table structure.  I ask that because the Filenames and the
> > Paths are already separated into two tables, and that seems to be what you
> > are proposing.  Can you be a bit more explicit?  Is it that the File
> > entries should be split into two separate tables -- one containing a link
> > to Paths only and one containing a link to Path/Filenames?  If so, is that
> > something that you would propose for Bacula in general?
> 
> >From my point of view, the main problem is that there is no efficient way to 
> get directly the list of subdirectories of a directory.
> Let's say for instance you want to know for a server which subdirectories are 
> in /home.
> You have to :
> - Get all path from Path begining with '/home' (with a like query). This will 
> return ALL subdirectories of /home (/home/marc, /home/toto, but 
> also /home/marc/.kde/share/apps ...). Even if you do a regexp match, you will 
> still have to scan a big chunk of the table, as the index will match 
> everythink begining with /home
> - Then you have to find each version of these PathIds from File table, to 
> know 
> if they should be visible or not (by jobid). So for the example of a /home, 
> let's say you are working in an environment with 100 workstations, you will 
> get every possible entry of /home that exists in each of these workstation 
> from Path table, then find everyone in File table to know if they should be 
> visible. When you have retrieved all these rows, you then have 2 choices :
>       - Either you choose to discard most of them (you just want to display 
> /home, 
> you don't care for /home/marc/.kde right now), and you'll have to redo nearly 
> the same query in the next 10 seconds, or you choose to store everything you 
> retrieve in memory. This can be problematic too of course, you may have 
> millions of records to retrieve, and very few of them will be used.
> 
> I didn't even mention 'time navigation' (how to display the same backup from 
> a 
> different backup point of view).
> 
> I'll describe the solution we chose with Eric below, but here's what we felt 
> we need :
> To be able to retrieve only what's needed, and do that very fast, we need :
>       - To be able to retrieve very fast the subdirectories of a directory 
> (it 
> means chaining them, for instance one can add a parent_pathid in the path 
> table)
>       - To be able to retrieve the 'visibility' of paths. There arised the 
> second 
> problem we had : a directory should be displayed if one of it's 
> subdirectories is to be displayed (for instance, I decided to 
> backup /home/marc and /home/eric, but not /home. /home should be displayed 
> nevertheless.)
> 
> We felt this should not be done during backup, as this is work that is not 
> required for anything but the GUI. But then, updating all the records of File 
> and Path table afterwards would lead to a great deal of fragmentation of 
> these tables. So we settled for what I describe below
> 
> >
> > > - Even better, if you could only retrieve as few entries as possible each
> > > time, it would be even better. It means being able to retrieve only the
> > > subdirectories of the current directory. This isn't feasible with the
> > > current schema either, as you need a way to link directories with
> > > subdirectories (bi-directonnaly if possible).
> >
> > I am also interested in understanding the tables that you and Eric use in
> > brestore to speed up retrieving multiple versions of a file.  Assuming it
> > is a good idea to split the current "File" table into "Dirs" and "Files",
> > do you think it would be a good idea to have the Bacula core build and
> > maintain any other tables for fast lookups?
> >
> > Another possible table would be a list of the "top" level Directories for
> > each Job.  That might significantly speed up retrieving the base directory
> > structure.
> 
> Here's what we've done :
> We created two tables :
> -pathvisibility (pathid, jobid)
> -pathhierarchy (ppathid, pathid)
> -knowjobids (i don't remember the realname, eric can correct me on this :) )
> 
> - pathhierarchy gives us the link between directories and subdirectories.
> - pathvisibility tells us when we should display a pathid (for which jobs)
> Both these tables are updated :
>       - By a cron job run everymorning (it does it for every jobid not in 
> knownjobids). It can be quite fast if we do it on a big scale (several jobids 
> at once)
>       - When we miss the data (brestore does it if we ask for a jobid that 
> hasn't 
> been calculated, ie it's not in knownjobids). Of course we try to avoid that, 
> but at least, if it's done, it's only done once. We do all this only 
> using 'big queries', not inserting records one by one.

Just looking at this I can tell that it would be very easy for me to
modify the version browser to be considerably faster with these two
tables.  The suggestion to make it easier of adding a top level
directory I agree with.  Not sure if it would NEED to be it's own table.
Maybe if pathvisibility had (pathid, jobid, isroot).  Then index the
table on isroot.  This would still provide possibilities for queries
that would be quite fast.

These ideas I like.



So these scripts to populate these tables for brestore, are they perl
scripts run in cron jobs??

Dirk
> 
> When you have this, wrinting the client becomes quite easy :
> - For the selected date, you determine which jobids should be used (you take 
> all incrementals till last last differential, this differential, then the 
> closest full)
> - Then for the selected path name, you get it's pathid. With this, you get 
> all 
> visible paths from pathvisibility/pathhierarchy (it rarely takes more than 
> 10ms), and all fileids from the file table that belong to pathid and are 
> visible for the selected jobids
> 
> The only hack we had to do was to have our own pruning jobs from the 
> database : we sometimes have to insert data into Path table (create the /home 
> path in my example), and didn't want it to be cleaned afterwards. Of course, 
> with a GUI belonging to the main project, this wouldn't be a problem, we 
> could change the pruning algorithm :)
> 
> The main advantages we felt using separate table were that :
> - we can delay the display calculation to when the main backups are done, so 
> the backups remain fast
> - the pathvisibility and pathhierarchy tables are very compact, so the 
> queries 
> on them are very fast. And as they are not used all the time, we can 
> even 'CLUSTER' them on postgresql, so as to sort them by ppathid physically 
> on disk, to make them even faster.
> 
> 
> I hope I've been clear enough, don't hesitate to ask me if it's not the 
> case :)
> BTW, I think that Eric will try do detail more precisely what we're doing 
> this 
> weekend, using the code.
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Bacula-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/bacula-devel


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to