Re: [Bacula-devel] Query changes in the catalog browser and indexes

Dan Langille Mon, 20 Aug 2007 05:22:50 -0700

On 19 Aug 2007 at 21:47, Dirk Bartley wrote:

> On Sun, 2007-08-19 at 13:06 -0400, Dan Langille wrote:
> > On 19 Aug 2007 at 12:35, Dirk Bartley wrote:
> >=20
> > > There is one alternative index to the above.  The index of the two
> > > colums (FilenameId, JobId) is really only needed for the condition of
> > > FilenameId =3D (SELECT FilenameId FROM Filename WHERE name=3D'').  A
> > > subquery can't be used when creating an index, but running one command
> > > and then substituting does work.  To create a partial index in postgres=
> :
> > > CREATE INDEX file_filenameid_jobid2 ON File (JobId) WHERE FilenameId=3D=
> 5;
> > > This could not be done until after a record exists in the filename tabl=
> e
> > > that fits the condition.
> >=20
> > The usefulness of that index can be determined by the queries that=20
> > use it.  I would expect that two indexes, one of filenameid and=20
> > another on jobid would work just as well.  But that too must be=20
> > tested.  On PostgreSQL, the "explain analyse" output will show what=20
> > is best.
> >=20
> > I can help with the query tuning on PostgreSQL.
> 
> I use postgresql for my work server and the last week of testing has
> been my experience using EXPLAIN in psql.  I've used EXPLAIN abaout 60
> times or so attempting to figure out optimizing indexes.  Any testing
> and explantion of an optimization above what I have done would certainly
> help me and educate me.
> 
> I may be wrong but I thoght I was suppsed to do
> 
> ANALYZE file;
> 
> EXPLAIN .. query ..;
> 
> After adding an index.    Is that what I'm supposed to do according to
> your recolection.  I kept using "explain" to determine if it would do a
> sequential scan or use the indexes as hoped.


I don't think you need to.  Try reading this:

    http://www.freebsddiary.org/postgresql-analyze.php

using 'explain analyse' will execute the query, giving you the actual 
time, and tell you the query plan.  Note the "execute the query" bit. 
That's probably not an issue if you're doing a SELECT, but be careful 
with DELETE/UPDATE.

> 
> 
> 
> There are basically three queries.  One to get a list of directories,
> another to get a list of files and one to get a list of versions for a
> specific file.
> 
> 1. To get a list of directories given a list of jobs.  Here is an
> example: =20
> 
> SELECT DISTINCT Path.Path AS Path, File.PathId AS PathId FROM File INNER
> JOIN Path ON (File.PathId=3DPath.PathId) WHERE File.FilenameId=3D3 AND
> File.Jobid IN
> (441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,42=
> 3,422,421,420,419,418,416,415,413,412,411,410,409,408,407,406,405,404,403,4=
> 02,400,399,398,397,396,395,394,393,392,391,385,384,383,382,381,379,378,376,=
> 375,374,373,372,371) ORDER BY Path
> Time: 3988.458 ms

That is four seconds.  I'd have to see the explain analyse output.

One this to know: a recent 'vacuum analyse' ensures that the query 
planner has the latest stats available.

An aside, and not suggesting this be done now: assuming that the 
JobIds are sequential, it might be useful to develop some code that 
changes that to:

  WHERE File.JobId BETWEEN 371 AND 441

The IN operator is pretty good now. In previous versions there was a 
slight penalty hit.

> Query the file table for all records that have a filenameid where
> name=3D'', the filenameid value is queried for earlier.  If the filenameid
> where filename=3D'' is retrieved as a subquery, the analyze does not do as
> well, especially with a partial index.  Therfore I have chosen an index
> which is either on (filenameid and jobid) or (a partial index on jobid
> where filenameid=3D3).  Either one of these seems to perform considerably
> better than separate indexe on filenameid and jobid.

A partial index is good, but not much use if we cannot predict what 
we need to partially index upon.

> I believe one of these performs better due to the file table containing
> a very large number of records.
> bacula=3D# SELECT COUNT(Fileid) FROM File;
>   count
> ----------
>  14049283
> (1 row)
> bacula=3D# SELECT COUNT(Fileid) FROM File Where FilenameId=3D5;
>   count
> ---------
>  1834962
> (1 row)
> The analyze shows the query geting a list of the directories for a
> certain specified list of jobs, we are only interested in the records
> where pathid=3D5 (or the pathid for pathname=3D'').  Then if this set of
> records is indexed on the jobs, the amount of time the query takes is
> mostly determined by the number of
> 
> So out of 14 million records in the file table, 1.8 million are for
> directories.  The record count to process is reduced in the query from
> 14 million to 1.8 million by using a multi columned index.
> 
> If both indexes are available to explain, it will use the partial index.


> 2. get a list of filenames in a certain path for a list of jobs.
> Example:
> 
> SELECT DISTINCT Filename.Name AS FileName, Filename.FilenameId AS
> FilenameId FROM File  INNER JOIN Filename on
> (Filename.FilenameId=3DFile.FilenameId) WHERE File.PathId=3D27604 AND
> File.Jobid IN
> (2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380=
> ,2363,2348,2333,2318,2303,2286,2271) AND Filename.Name!=3D'' ORDER BY FileN=
> ame
> 
> This one is easy.  EXPLAIN says it will use the index in Pathid and
> performs it in Time: 4.383 ms.  This index does not exist in the install
> script.

That is a good performance time.

> 
> 3. get a list of versions of a file.  Example:
> 
> SELECT Job.JobId AS JobId, Job.Level AS Type, Job.EndTime AS EndTime,
> File.Md5 AS MD5, File.FileId AS FileId FROM File INNER JOIN Filename on
> (Filename.FilenameId=3DFile.FilenameId) INNER JOIN Path ON
> (Path.PathId=3DFile.PathId) INNER JOIN Job ON (File.JobId=3DJob.JobId) WHER=
> E
> Path.PathId=3D27604 AND Filename.FilenameId=3D1049929 AND Job.Jobid IN
> (2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380=
> ,2363,2348,2333,2318,2303,2286,2271) ORDER BY Job.EndTime DESC
> 
> This one is also easy, EXPLAIN says it will use the index on
> (filenameid, jobid), which already exists in the install script.
> Time: 42.523 ms

That's a good time.  Now, do we need that index for *anything* else?  
If we remove it, what is the result?  Sometimes fewer indexes is a 
good thing.

-- 
Dan Langille - http://www.langille.org/
Available for hire: http://www.freebsddiary.org/dan_langille.php



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to