On Mon, 2007-08-20 at 08:04 -0400, Dan Langille wrote:
> >
> > There are basically three queries. One to get a list of directories,
> > another to get a list of files and one to get a list of versions for a
> > specific file.
> >
> > 1. To get a list of directories given a list of jobs. Here is an
> > example: =20
> >
> > SELECT DISTINCT Path.Path AS Path, File.PathId AS PathId FROM File INNER
> > JOIN Path ON (File.PathId=3DPath.PathId) WHERE File.FilenameId=3D3 AND
> > File.Jobid IN
> > (441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,42=
> > 3,422,421,420,419,418,416,415,413,412,411,410,409,408,407,406,405,404,403,4=
> > 02,400,399,398,397,396,395,394,393,392,391,385,384,383,382,381,379,378,376,=
> > 375,374,373,372,371) ORDER BY Path
> > Time: 3988.458 ms
>
> That is four seconds. I'd have to see the explain analyse output.
Unique (cost=338755.18..340527.31 rows=6297 width=36) (actual
time=4000.332..4000.437 rows=16 loops=1)
-> Sort (cost=338755.18..339345.89 rows=236283 width=36) (actual
time=4000.330..4000.359 rows=204 loops=1)
Sort Key: path.path, file.pathid
-> Hash Join (cost=16509.58..307281.74 rows=236283 width=36)
(actual time=3932.884..3994.564 rows=204 loops=1)
Hash Cond: (file.pathid = path.pathid)
-> Index Scan using file_filenameid_jobid2 on file
(cost=0.00..280519.50 rows=236283 width=4) (actual
time=130.950..3289.826 rows=204 loops=1)
Filter: (jobid = ANY
('{2629,2616,2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380,2363,2348,2333,2318,2303}'::integer[]))
-> Hash (cost=8713.59..8713.59 rows=366159 width=36)
(actual time=453.745..453.745 rows=366951 loops=1)
-> Seq Scan on path (cost=0.00..8713.59
rows=366159 width=36) (actual time=0.021..179.985 rows=366951 loops=1)
Total runtime: 4000.694 ms
(10 rows)
This shows the database choosing to use the partial index when both
exist. My impression of the partial indexex advantages is that it is
less expensive to maintain. Maintenence of the index will only be
triggered when a record in file is modified that contains the
filenameid. The disadvantage is that it could not be created until
after the first job is done and the filenameid where name='' can be
queried for.
>
> One this to know: a recent 'vacuum analyse' ensures that the query
> planner has the latest stats available.
>
> An aside, and not suggesting this be done now: assuming that the
> JobIds are sequential, it might be useful to develop some code that
> changes that to:
>
> WHERE File.JobId BETWEEN 371 AND 441
Unfortunately, this is usually not the case.
>
> The IN operator is pretty good now. In previous versions there was a
> slight penalty hit.
I think the current "penalty" comes with the large number of jobs that
can be in the list.
>
> > Query the file table for all records that have a filenameid where
> > name=3D'', the filenameid value is queried for earlier. If the filenameid
> > where filename=3D'' is retrieved as a subquery, the analyze does not do as
> > well, especially with a partial index. Therfore I have chosen an index
> > which is either on (filenameid and jobid) or (a partial index on jobid
> > where filenameid=3D3). Either one of these seems to perform considerably
> > better than separate indexe on filenameid and jobid.
>
> A partial index is good, but not much use if we cannot predict what
> we need to partially index upon.
An explanation of how to create it could be put in the help. It would
also be possible to create a script to create it to be run after at
least one job has been run. I could even create code in bat to add the
partial index. but then it would have to have at least one option for
each database possible.
>
> > I believe one of these performs better due to the file table containing
> > a very large number of records.
> > bacula=3D# SELECT COUNT(Fileid) FROM File;
> > count
> > ----------
> > 14049283
> > (1 row)
> > bacula=3D# SELECT COUNT(Fileid) FROM File Where FilenameId=3D5;
> > count
> > ---------
> > 1834962
> > (1 row)
> > The analyze shows the query geting a list of the directories for a
> > certain specified list of jobs, we are only interested in the records
> > where pathid=3D5 (or the pathid for pathname=3D''). Then if this set of
> > records is indexed on the jobs, the amount of time the query takes is
> > mostly determined by the number of
> >
> > So out of 14 million records in the file table, 1.8 million are for
> > directories. The record count to process is reduced in the query from
> > 14 million to 1.8 million by using a multi columned index.
> >
> > If both indexes are available to explain, it will use the partial index.
>
>
> > 2. get a list of filenames in a certain path for a list of jobs.
> > Example:
> >
> > SELECT DISTINCT Filename.Name AS FileName, Filename.FilenameId AS
> > FilenameId FROM File INNER JOIN Filename on
> > (Filename.FilenameId=3DFile.FilenameId) WHERE File.PathId=3D27604 AND
> > File.Jobid IN
> > (2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380=
> > ,2363,2348,2333,2318,2303,2286,2271) AND Filename.Name!=3D'' ORDER BY FileN=
> > ame
> >
> > This one is easy. EXPLAIN says it will use the index in Pathid and
> > performs it in Time: 4.383 ms. This index does not exist in the install
> > script.
>
> That is a good performance time.
The question below about the index being used for anything else is
probably more suited for this index since it is a new one. The answer
is, to my knowledge it won't be used anywhere else. The index on pathid
is the perfect one because pathid is the known value and it turns a very
long query into a very short one. After dropping this index, a runtime
test says it took 22589.825 ms instead of 4 ms. explain analyze said it
would use file_fp_idx which is the multi column index on filenameid and
pathid that already exists.
Dirk
>
> >
> > 3. get a list of versions of a file. Example:
> >
> > SELECT Job.JobId AS JobId, Job.Level AS Type, Job.EndTime AS EndTime,
> > File.Md5 AS MD5, File.FileId AS FileId FROM File INNER JOIN Filename on
> > (Filename.FilenameId=3DFile.FilenameId) INNER JOIN Path ON
> > (Path.PathId=3DFile.PathId) INNER JOIN Job ON (File.JobId=3DJob.JobId) WHER=
> > E
> > Path.PathId=3D27604 AND Filename.FilenameId=3D1049929 AND Job.Jobid IN
> > (2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380=
> > ,2363,2348,2333,2318,2303,2286,2271) ORDER BY Job.EndTime DESC
> >
> > This one is also easy, EXPLAIN says it will use the index on
> > (filenameid, jobid), which already exists in the install script.
> > Time: 42.523 ms
>
> That's a good time. Now, do we need that index for *anything* else?
> If we remove it, what is the result? Sometimes fewer indexes is a
> good thing.
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel