Hello Marc,

I don't yet understand the details of how you use these tables, but it seems 
to me that we could modify the Bacula table structure as follows:

- Split the File table into two new tables:
    - Files containing only file entries
    - Dirs containing the equivalent of a current File entry for
        each directory, except that it would have a pointer to the
        parent Path entry, and possibly a pointer to the parent Dirs
        entry (however that entry may not exist in each backup).
        If necessary we could add the visibility flag -- I don't see its use
        yet.

This would separate the Bacula File and Directory entries making for a lot of 
efficiency when one just wants the directory structure.  It is something that 
Bacula could easily do (I think) when inserting the records, and we can add a 
few additional links.  The fact of splitting the File table in two means that 
the new Files table no longer needs the PathId link, which could be reused to 
point to the Dirs table.  Thus we would be adding very little extra data 
space to the tables.  Of course, there would be new index files but that 
happens in any case.

Do you think this could work out?

By the way, I will be gone for the next two days, so I won't be able to 
respond until Tuesday ...

Best regards,

Kern


On Saturday 25 August 2007 23:24, Marc Cousin wrote:
> On Saturday 25 August 2007 22:25:00 Dan Langille wrote:
> > On 25 Aug 2007 at 22:07, Marc Cousin wrote:
> > > On Saturday 25 August 2007 21:34:25 Dan Langille wrote:
> > > > On 25 Aug 2007 at 10:51, Eric Bollengier wrote:
> > > > > On Saturday 25 August 2007 10:31:04 you wrote:
> > > > > > Hello Marc,
> > > > > >
> > > > > > Thanks for the information.  I think this is something that is
> > > > > > probably well worth the effort to put in the core Bacula code. 
> > > > > > I'm going to study it a bit to see how it might be organized into
> > > > > > a project.  Code like this might also help a lot for the simple
> > > > > > restores that Bacula currently does by speeding up the retrieval.
> > > > > >
> > > > > > Perhaps Eric could send me the SQL that creates the tables so
> > > > > > that I can see more clearly all the fields.
> > > > >
> > > > > Sql definitions :
> > > > >
> > > > >     CREATE TABLE brestore_knownjobid
> > > > >     (
> > > > >      JobId int4 NOT NULL,
> > > > >      CONSTRAINT brestore_knownjobid_pkey PRIMARY KEY (JobId)
> > > > >     );
> > > > >
> > > > >    CREATE TABLE brestore_pathhierarchy
> > > > >    (
> > > > >      PathId int4 NOT NULL,
> > > > >      PPathId int4 NOT NULL,
> > > > >      CONSTRAINT brestore_pathhierarchy_pkey PRIMARY KEY (PathId)
> > > > >    );
> > > > >
> > > > >    CREATE INDEX brestore_pathhierarchy_ppathid
> > > > >                           ON brestore_pathhierarchy (PPathId);
> > > > >
> > > > >     CREATE TABLE brestore_pathvisibility
> > > > >     (
> > > > >       PathId int4 NOT NULL,
> > > > >       JobId int4 NOT NULL,
> > > > >       Size int8 DEFAULT 0,                -- used for statistics
> > > > >       Files int4 DEFAULT 0,               -- used for statistics
> > > > >       CONSTRAINT brestore_pathvisibility_pkey PRIMARY KEY (JobId,
> > > > > PathId) );
> > > > >
> > > > >     CREATE INDEX brestore_pathvisibility_jobid
> > > > >                           ON brestore_pathvisibility (JobId);
> > > > >
> > > > >
> > > > > For example, this is how we list files in a directory :
> > > > >
> > > > > SELECT File.FilenameId, listfiles.id, listfiles.Name, File.LStat,
> > > > > File.JobId FROM
> > > > >       (SELECT Filename.Name, max(File.FileId) as id
> > > > >        FROM File, Filename
> > > > >        WHERE File.FilenameId = Filename.FilenameId
> > > > >          AND Filename.Name != ''
> > > > >          AND File.PathId IN (     10,20,20202,20202        )
> > > > >          AND File.JobId IN (        1,2,3,4           )
> > > > >        GROUP BY Filename.Name
> > > > >        ORDER BY Filename.Name) AS listfiles,
> > > > > File
> > > > > WHERE File.FileId = listfiles.id
> > > > >
> > > > > I think, i will wrote a new class (something like Bvfs) which will
> > > > > do basic operations like in a real filesystem.
> > > > >
> > > > > ch_dir(pathid)
> > > > >       Change current directory to pathid
> > > > > ls_dirs()
> > > > >       List all directories in the current directory (pathid and jobid)
> > > > > ls_files()
> > > > >       List all files in a the current directory (pathid and jobid)
> > > > > up_dir()
> > > > >       Change to parent directory
> > > > > pwd()
> > > > >       Get the current pathid
> > > > > get_pathid(path)
> > > > >       Return pathid from a given path
> > > > > get_root()
> > > > >       Get root pathid
> > > > > get_all_file_versions(...)
> > > > >       Get all versions for a file
> > > >
> > > > Can you explain the pros and cons over something like a parent-child
> > > > relationship?  That each, each file entry has a link to the parent
> > > > directory.  This would be a self referential table, with some entries
> > > > being files, some being directories.  This is what I use for the
> > > > FreshPorts database.
> > > >
> > > > See http://news.freshports.org/2007/08/03/freshports-database-primer/
> > >
> > > The pro is that you can get the parent or the children of a directory
> > > directly using an index. it's something you cannot do with only the
> > > path.
> >
> > This is possible with a parent-child relationship.  You seem to say
> > it is not.  I think we may be referring to different things.  The
> > table structure I am referring to is something like this:
> >
> >     file (id, parent_id)
> >
> > where parent_id is another entry in the file table.
>
> the pathhierarchy we're talking about is exactly this, focused on
> directories. That's what we're using right now with brestore. The
> difficulty here is that a path can be the subdirectory of another path on a
> machine (for a certain jobid...) and not of another one. hence the
> pathvisibility table (tells us which path should be displayed for which
> jobid)
>
> > > this relation already exists between dirs and files (in the file
> > > table), but not between dirs and subdirs
> >
> > In the Bacula file table as it stands now?  I don't think so.  This
> > really makes me think we are referring to two different things.
> >
> > The reason I raise my questions has to do with me thinking we don't
> > need additional tables.  Just additional relationships, and no need
> > for a cron job to update tables.
>
> In the File table, you have the Pathid. You don't really have the
> bi-directionnal relationship, of course, but it has no real meaning for a
> file. A directory contains files, a file does not contain directory. What I
> meant is that we already have all we need to display the files contained in
> a directory. What we lack is the directories contained in a directory. We
> can get it with a LIKE 'parentpath%' or a regexp, but it is very
> ineficient. That's why we created our 2 brestore tables. We also wanted
> them to be very small, and duplicate as little data as possible.
>
> We can do it with a supplemental column in Path for instance. But I think
> it will be inneficient : an update for a very big amount of records is
> slow, and may fragment records for certain databases, or take temporarily
> twice the amount of space during the update (for postgres). What I mean
> there is just that I'm not sure updating all the records we've just
> inserted won't lead to performance problems during the update or after it.
> For postgresql, it will create double the records (temporarily).
> For mysql, if you use myisam, the update will (probably) be in place, but
> you'll lock the file table (and degrade any other job working). If you use
> innodb, I don't know how it will behave. Is it multiversion like postgres ?
> Or does it use a sort of undo space like oracle ? Both cases are
> problematic..
> Moreover, from the database point of view, it means you first
> insert 'incomplete' records in the database, then you 'tidy' them up. It
> seems a bit weird to me : either the record is in the file table and it's
> good, or it's not in the table at all...
>
> The advantage of dedicated tables is that they do not interfere with the
> current code, are very compact, and can be rearranged without ever locking
> the file table. And we keep 'display only' data out of the main tables. But
> it feels less 'pure' from a relational point of view.
>
> Of course, it's just my point of view :)
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Bacula-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/bacula-devel

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to