Re: [Bacula-devel] Query changes in the catalog browser and indexes

Dirk Bartley Wed, 22 Aug 2007 19:02:18 -0700

On Wed, 2007-08-22 at 10:11 -0400, Dan Langille wrote:
> On 21 Aug 2007 at 20:07, Dirk Bartley wrote:
> 
> > On Mon, 2007-08-20 at 08:04 -0400, Dan Langille wrote:
> > > > 
> > > > There are basically three queries.  One to get a list of directories,
> > > > another to get a list of files and one to get a list of versions for a
> > > > specific file.
> > > > 
> > > > 1. To get a list of directories given a list of jobs.  Here is an
> > > > example: =20
> > > > 
> > > > SELECT DISTINCT Path.Path AS Path, File.PathId AS PathId FROM File INNER
> > > > JOIN Path ON (File.PathId=3DPath.PathId) WHERE File.FilenameId=3D3 AND
> > > > File.Jobid IN
> > > > (441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,42=
> > > > 3,422,421,420,419,418,416,415,413,412,411,410,409,408,407,406,405,404,403,4=
> > > > 02,400,399,398,397,396,395,394,393,392,391,385,384,383,382,381,379,378,376,=
> > > > 375,374,373,372,371) ORDER BY Path
> > > > Time: 3988.458 ms
> > > 
> > > That is four seconds.  I'd have to see the explain analyse output.
> > 
> >  Unique  (cost=338755.18..340527.31 rows=6297 width=36) (actual
> > time=4000.332..4000.437 rows=16 loops=1)
> >    ->  Sort  (cost=338755.18..339345.89 rows=236283 width=36) (actual
> > time=4000.330..4000.359 rows=204 loops=1)
> >          Sort Key: path.path, file.pathid
> >          ->  Hash Join  (cost=16509.58..307281.74 rows=236283 width=36)
> > (actual time=3932.884..3994.564 rows=204 loops=1)
> >                Hash Cond: (file.pathid = path.pathid)
> >                ->  Index Scan using file_filenameid_jobid2 on file
> > (cost=0.00..280519.50 rows=236283 width=4) (actual
> > time=130.950..3289.826 rows=204 loops=1)
> >                      Filter: (jobid = ANY
> > ('{2629,2616,2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380,2363,2348,2333,2318,2303}'::integer[]))
> >                ->  Hash  (cost=8713.59..8713.59 rows=366159 width=36)
> > (actual time=453.745..453.745 rows=366951 loops=1)
> >                      ->  Seq Scan on path  (cost=0.00..8713.59
> > rows=366159 width=36) (actual time=0.021..179.985 rows=366951 loops=1)
> >  Total runtime: 4000.694 ms
> > (10 rows)
> > 
> > This shows the database choosing to use the partial index when both
> > exist.  My impression of the partial indexex advantages is that it is
> > less expensive to maintain.  Maintenence of the index will only be
> > triggered when a record in file is modified that contains the
> > filenameid.  The disadvantage is that it could not be created until
> > after the first job is done and the filenameid where name='' can be
> > queried for.
> 
> Partial indexes are usually only useful if what you are looking for 
> is a constant.


My hope and current understanding is that the filenameid which is
returned from the query
SELECT FilenameId FROM Filename WHERE name=''
is "relatively" constant and will not change until the last job that has
a directory is purged from the database.

> I think you are proposing creating an index on-the-
> fly.

On the fly is a term that I would use if the index were being re-created
on a regular basis, this index is not.

>   I think this is not the way to go.  I think we can/should find 
> another way to improve this query's performance.

If there is another way, that would be most excellent.  All ways  that I
can imagine involve changes in the schema, like if there was a client
column in the file table, an index could be put on that and the
interface could be modified to not worry about jobs, only the client.

> 
> If you can put your database on a website somewhere, I'll download it 
> and try it out and work on the query that way.  

I will respectfully choose not to do that at this time.  I fear there
may be reprisals from management if it were ever found out that I
allowed corporate data which describes our entire filestructure
available.  Not to mention using a gigs worth of our companies bandwidth
and that we have no web server to place the file on.

> With this, we'll both 
> be working with the same dataset, and I can show you how to make 
> things faster.

I think we can do this with separate data as long as you have a
reasonably large enough data set to be comparable.  

SELECT DISTINCT Path.Path AS Path, File.PathId AS PathId FROM File INNER
JOIN Path ON (File.PathId=Path.PathId) WHERE File.FilenameId=YVAL AND
File.Jobid IN (XLIST) ORDER BY Path

just use the above and replace YVAL with the result of:
SELECT FilenameId FROM Filename WHERE name=''

and replace XLIST with the result of:
SELECT Job.Jobid AS Id FROM Job INNER JOIN Client ON
(Job.ClientId=Client.ClientId) INNER JOIN FileSet ON
(Job.FileSetId=FileSet.FileSetId) WHERE Client.Name='srvalum1-fd' ORDER
BY Job.EndTime DESC LIMIT 20

of course replace srvalum1-fd with your favorite client.

Thank you very much for your willingness to take a look at the indexes I
am proposing.

Dirk





-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to