Re: [Bacula-devel] Query changes in the catalog browser and indexes

Marc Cousin Sun, 02 Sep 2007 05:46:53 -0700

> Actually, it is worse than what you describe above.  The problem is that I
> did not consider in my proposal what happens in an Incremental backup.  As
> proposed, it simply will not work.
>
> > Maybe it would be easier to add a parentid in the Path table.
>
> This is probably a nice solution that will help improve performance a lot.


We still would have to work out a way of inserting this data efficiently : I 
think we ought not try to do single selects to find the parent path of a 
path, it would defeat all we've done for batch inserts. There are some ways 
to do it with batch, it will just require a bit of implementation.

>
> > Of course it
> > means we don't restrain links to the ones that should be displayed for a
> > given server... But this is then easily filtered matching data from the
> > Dir or File table and saves a lot of space. Of course, it defeats the
> > purpose of having an easy way to recognise 'root' directories, as the
> > info isn't there anymore... Maybe then this info should be stored in
> > another place ? Something like having more metadata in the job table (or
> > another table describing all the root directories associated with a
> > peculiar job, or anythink of this sort, I really don't know)
>
> Yes, I think we will need some new table.  If this interests you, I would
> be interested in what you could come up for a proposal.  The problem I have
> in designing this is that I have never writing code like your brestore, so
> I probably would not get the table structure right the first time (missing
> minor points).

There are at least three ways of doing it : 
- we do a 'brute' insertion of the jobs caracteristics, in, say, a big varchar 
or blob, and we parse it at runtime from the GUI and other programs. It means 
that if we change the syntax of the configuration files between releases, all 
programs need to be able to understand several representations
- we do a meta-schema for these (I don't like it either, I feel it's just the 
same as the previous one
- we go for 'simplicity' : we store only what's useful from the job's 
configuration. If we need some more data later, we'll store it, but for now, 
what I see us needing is the root directories. It's the way I think would be 
best : should you go for a XML configuration file (it's just an example, I 
beg you not to do this :) ), we'd just have to convert the storing part, not 
the rest.

If you agree we have to go for this third one, I think we first need to define 
what we want to store for a job. Except for what we already have in the job 
table, I can only see the list of root directories for now.

>
> Now is probably the time to implement the tables, even if we cannot
> implement all the Bacula core code necessary, since the next release of
> Bacula will probably be around the end of the year, and will be version
> 3.0.0.  That will be the first version free of the OpenSSL license
> constraints, and it would be a good time to make any database
> reorganization such as adding a new field to the Path table and adding new
> tables.

Got that... but one of bacula's current strenghs is that the scehma is 
reasonably simple and compact and gives good backup performance (which I 
thnik is the most important). We have to be careful not to compromise that, 
not too much at least.

>
> > Having a table/set of tables describing precisely how a backup was done
> > may be very interesting compared to storing a big amount of useless data
> > in these tables : it seems better paying a fixed amount on saving full
> > metadata of each backup than waste 4 bytes per dir to save the parentid
> > of every dir we back up ?
>
> Yes, my intention was never to store a whole lot of extra data, and
> certainly, we need to be careful in doing Diff and Inc backups that we
> don't store a whole directory tree.
>
> On the other side of that, you mentioned somewhere that you didn't see the
> need for the PathId in the File record if we have a Dirs table (which I
> still think is a good idea since it separates Files and Directories).  The
> reason I kept the PathId was two fold:
>
> 1. It means that none of the existing code that uses PathId needs to be
> changed.  That doesn't mean that we cannot change it if we find better
> mechanisms, but it takes some of the pressure off.
>
> 2. I don't like multiple SQL links.  I.e. I don't much like that to get the
> Path, from a File record that you have to do a lookup in the Dirs table and
> then a lookup in the Path table.

>From a performance point of view, you are right, it's obviously better to do 
one join than two. The problem is that we may have 2 ways of finding the path 
in which a file is stored : through the dir table, and directly through the 
path table.
I'd say we throw the link between dir and file tables : a path doesn't really 
belong to a version of a backed up directory, the file belongs to a path... 
the path being, in fact, only a way of not storing the full path of the file 
for each backup...

>
> Actually, in my proposal, I didn't think that adding a DirsId was really
> necessary in the File table, but it is probably desirable.
Can you explain why ? I don't see the point of having this link

>
> Given that my proposal did not take into account Diff and Inc backups, I
> think it needs to be totally reworked.  Ideas that I like:
>
> 1. Separate the current File table into Files and Dirs.
> 2. Add a ParentId link in the Path table.
> 3. Add a new table (or two) that provides the rest of the information that
>     a "browser" needs for efficient lookups.
> 4. Try to design it so that most if not all the entries can be created
> during backup.
> 5. Try to design it so that users have some flexibility (probably via
> Bacula scripts) as to which indexes are created.  I.e. if they do not
> browse maybe some indexes can be eliminated, and if someone browses a lot
> then have some scripts that will easily add the indexes for the user.
>



> > > > In the Files table, in *addition* to the existing columns (Path and
> > > > Filename), if we need it we can have a DirId, which points to the
> > > > Dirs record for the given Path and Filename.
> > >
> > > If we have the dirid, we don't need the pathid anymore, I guess, as it
> > > would be in the dir table.
> >
> > Then comes another doubt :)
> > What happens if the dirid isn't there anymore ? (we have made an
> > incremental backup of a file, and the full it refers to doesn't exist
> > anymore)
> >
> > > > To do the above, we
> > > > 1. Split the FIle table into Dirs and Files
> > > > 2. Add one new column to Files, which is DirId (if necessary)
> > > > 3. Delete the FilenameId from the Dirs record (i.e. it is identical
> > > > to the current File record less the FilenameId column).
> > > > 4. Add one new ParentId column to the Dirs table.
> > >
> > > Here I've got a question : will you calculate the ParentId at insert
> > > time ? (we must avoid updates, it has a big performance impact on all
> > > transactional SGBDs). I really don't know how much it will cost, but it
> > > may slow down database insertions by a big amount... The parentid dir
> > > may not even be in the same job...
> > >
> > > The point of brestore's method is to calculate as little of these links
> > > as possible, thanks to the hierarchy table : the links between dirs and
> > > their parents is not correlated with jobs in our case, so we do the
> > > links once for all the jobs, except in the case of a new directory.
> >
> > If we store parentpathid in the path table,it becomes much less costly...
> > but then again, we don't have the root directories information at hand
> > anymore.
>
> I think having the top level directories information is important.  There
> is probably no reason why we cannot add it for each JobId.  I'm sure you
> are fully aware of it, but some developers may not be, but such a table
> would need to have multiple values for the top level directory for a given
> JobId. For Linux, all the top level directories are ultimately accessable
> through /, but for Windows, there can actually be multiple roots.

I think that if we have :
- the top level directories for each jobid
- a way to get all potential path entries inside a chosen path in one query

we can already find a way to find which of these path are to be displayed:
- For a given jobid, we must display :
        - All the path that have an entry, either in the File table if we don't 
split 
it or in the Dir table if we split. For these we display the metadata (mtime 
etc...)
        - All parent paths of those that are top level directories of this 
jobid (for 
instance we are in / and /home/marc has been backed up means that /home must 
be displayed). For these, we don't display metadata...

If we want to do a full time navigation as with brestore, this becomes :
We still retrieve all potential paths from path table, then
- For a given list of jobids, we must display :
        - The latest version of all paths found from path that have an entry in 
file 
or dir
        - All parent paths of those that are top level directories of thes 
jobids. 
For these, we don't display metadata...

The only win of splitting file and dir for this algorithm is that dir will be 
smaller, so the queries will be even faster...

If we have stored, for each jobid, all the top level directories, we don't 
need the pathvisibility table anymore. It's purpose was to rebuild all this 
missing data from what had been backed up.

>
> As noted above, if we can come up with a new table structure in say the
> next month, I believe that we can get it into the 3.0.0 release, even if we
> don't have all the backend code implemented.  Once the table structure is
> implemented (i.e. in 3.0.0), we can if necessary add additional
> functionality in subreleases (i.e. start using the new columns).


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to