Re: [Bacula-devel] Query changes in the catalog browser and indexes

Marc Cousin Sat, 25 Aug 2007 14:24:26 -0700

On Saturday 25 August 2007 22:25:00 Dan Langille wrote:
> On 25 Aug 2007 at 22:07, Marc Cousin wrote:
> > On Saturday 25 August 2007 21:34:25 Dan Langille wrote:
> > > On 25 Aug 2007 at 10:51, Eric Bollengier wrote:
> > > > On Saturday 25 August 2007 10:31:04 you wrote:
> > > > > Hello Marc,
> > > > >
> > > > > Thanks for the information.  I think this is something that is
> > > > > probably well worth the effort to put in the core Bacula code.  I'm
> > > > > going to study it a bit to see how it might be organized into a
> > > > > project.  Code like this might also help a lot for the simple
> > > > > restores that Bacula currently does by speeding up the retrieval.
> > > > >
> > > > > Perhaps Eric could send me the SQL that creates the tables so that
> > > > > I can see more clearly all the fields.
> > > >
> > > > Sql definitions :
> > > >
> > > >     CREATE TABLE brestore_knownjobid
> > > >     (
> > > >      JobId int4 NOT NULL,
> > > >      CONSTRAINT brestore_knownjobid_pkey PRIMARY KEY (JobId)
> > > >     );
> > > >
> > > >    CREATE TABLE brestore_pathhierarchy
> > > >    (
> > > >      PathId int4 NOT NULL,
> > > >      PPathId int4 NOT NULL,
> > > >      CONSTRAINT brestore_pathhierarchy_pkey PRIMARY KEY (PathId)
> > > >    );
> > > >
> > > >    CREATE INDEX brestore_pathhierarchy_ppathid
> > > >                           ON brestore_pathhierarchy (PPathId);
> > > >
> > > >     CREATE TABLE brestore_pathvisibility
> > > >     (
> > > >       PathId int4 NOT NULL,
> > > >       JobId int4 NOT NULL,
> > > >       Size int8 DEFAULT 0,                -- used for statistics
> > > >       Files int4 DEFAULT 0,               -- used for statistics
> > > >       CONSTRAINT brestore_pathvisibility_pkey PRIMARY KEY (JobId,
> > > > PathId) );
> > > >
> > > >     CREATE INDEX brestore_pathvisibility_jobid
> > > >                           ON brestore_pathvisibility (JobId);
> > > >
> > > >
> > > > For example, this is how we list files in a directory :
> > > >
> > > > SELECT File.FilenameId, listfiles.id, listfiles.Name, File.LStat,
> > > > File.JobId FROM
> > > >         (SELECT Filename.Name, max(File.FileId) as id
> > > >          FROM File, Filename
> > > >          WHERE File.FilenameId = Filename.FilenameId
> > > >            AND Filename.Name != ''
> > > >            AND File.PathId IN (     10,20,20202,20202        )
> > > >            AND File.JobId IN (        1,2,3,4           )
> > > >          GROUP BY Filename.Name
> > > >          ORDER BY Filename.Name) AS listfiles,
> > > > File
> > > > WHERE File.FileId = listfiles.id
> > > >
> > > > I think, i will wrote a new class (something like Bvfs) which will
> > > > do basic operations like in a real filesystem.
> > > >
> > > > ch_dir(pathid)
> > > >         Change current directory to pathid
> > > > ls_dirs()
> > > >         List all directories in the current directory (pathid and jobid)
> > > > ls_files()
> > > >         List all files in a the current directory (pathid and jobid)
> > > > up_dir()
> > > >         Change to parent directory
> > > > pwd()
> > > >         Get the current pathid
> > > > get_pathid(path)
> > > >         Return pathid from a given path
> > > > get_root()
> > > >         Get root pathid
> > > > get_all_file_versions(...)
> > > >         Get all versions for a file
> > >
> > > Can you explain the pros and cons over something like a parent-child
> > > relationship?  That each, each file entry has a link to the parent
> > > directory.  This would be a self referential table, with some entries
> > > being files, some being directories.  This is what I use for the
> > > FreshPorts database.
> > >
> > > See http://news.freshports.org/2007/08/03/freshports-database-primer/
> >
> > The pro is that you can get the parent or the children of a directory
> > directly using an index. it's something you cannot do with only the path.
>
> This is possible with a parent-child relationship.  You seem to say
> it is not.  I think we may be referring to different things.  The
> table structure I am referring to is something like this:
>
>     file (id, parent_id)
>
> where parent_id is another entry in the file table.


the pathhierarchy we're talking about is exactly this, focused on directories. 
That's what we're using right now with brestore. The difficulty here is that 
a path can be the subdirectory of another path on a machine (for a certain 
jobid...) and not of another one. hence the pathvisibility table (tells us 
which path should be displayed for which jobid)

>
> > this relation already exists between dirs and files (in the file table),
> > but not between dirs and subdirs
>
> In the Bacula file table as it stands now?  I don't think so.  This
> really makes me think we are referring to two different things.
>
> The reason I raise my questions has to do with me thinking we don't
> need additional tables.  Just additional relationships, and no need
> for a cron job to update tables.


In the File table, you have the Pathid. You don't really have the 
bi-directionnal relationship, of course, but it has no real meaning for a 
file. A directory contains files, a file does not contain directory. What I 
meant is that we already have all we need to display the files contained in a 
directory. What we lack is the directories contained in a directory. We can 
get it with a LIKE 'parentpath%' or a regexp, but it is very ineficient. 
That's why we created our 2 brestore tables. We also wanted them to be very 
small, and duplicate as little data as possible.

We can do it with a supplemental column in Path for instance. But I think it 
will be inneficient : an update for a very big amount of records is slow, and 
may fragment records for certain databases, or take temporarily twice the 
amount of space during the update (for postgres). What I mean there is just 
that I'm not sure updating all the records we've just inserted won't lead to 
performance problems during the update or after it.
For postgresql, it will create double the records (temporarily).
For mysql, if you use myisam, the update will (probably) be in place, but 
you'll lock the file table (and degrade any other job working). If you use 
innodb, I don't know how it will behave. Is it multiversion like postgres ? 
Or does it use a sort of undo space like oracle ? Both cases are 
problematic..
Moreover, from the database point of view, it means you first 
insert 'incomplete' records in the database, then you 'tidy' them up. It 
seems a bit weird to me : either the record is in the file table and it's 
good, or it's not in the table at all...

The advantage of dedicated tables is that they do not interfere with the 
current code, are very compact, and can be rearranged without ever locking 
the file table. And we keep 'display only' data out of the main tables. But 
it feels less 'pure' from a relational point of view.

Of course, it's just my point of view :)

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to