Re: [Bacula-devel] Query changes in the catalog browser and indexes

Cousin Marc Mon, 20 Aug 2007 07:01:49 -0700

Hi, I'm interfering in the discussion as I've worked with Eric on brestore 
(doing the SQL part), doing something looking a lot like what you're doing 
now. My comments are inline

Le Monday 20 August 2007 03:47:15 Dirk Bartley, vous avez écrit :
> On Sun, 2007-08-19 at 13:06 -0400, Dan Langille wrote:
> > On 19 Aug 2007 at 12:35, Dirk Bartley wrote:
> > > There is one alternative index to the above.  The index of the two
> > > colums (FilenameId, JobId) is really only needed for the condition of
> > > FilenameId = (SELECT FilenameId FROM Filename WHERE name='').  A
> > > subquery can't be used when creating an index, but running one command
> > > and then substituting does work.  To create a partial index in
> > > postgres: CREATE INDEX file_filenameid_jobid2 ON File (JobId) WHERE
> > > FilenameId=5; This could not be done until after a record exists in the
> > > filename table that fits the condition.

Ok, but the FilenameId will be different from database to database ... so 
watchout for the SQL creation script.

> >
> > The usefulness of that index can be determined by the queries that
> > use it.  I would expect that two indexes, one of filenameid and
> > another on jobid would work just as well.  But that too must be
> > tested.  On PostgreSQL, the "explain analyse" output will show what
> > is best.

This will work correctly on Postgres, because it will do a bitmap index scan 
if it sees fit. It won't work with mysql though, as it can only use one index 
for a scan (if there's a mysql guru around wanting to correct me on this go 
ahead, I'm not sure of it :) )
It won't be as efficient as a two-column index or a partial index, anyhow.

> I use postgresql for my work server and the last week of testing has
> been my experience using EXPLAIN in psql.  I've used EXPLAIN abaout 60
> times or so attempting to figure out optimizing indexes.  Any testing
> and explantion of an optimization above what I have done would certainly
> help me and educate me.
>
> I may be wrong but I thoght I was suppsed to do
>
> ANALYZE file;

Analyze file builds statistics on the table (cardinality of the table, % of 
different values of each column, most common values, etc...)
Explain analyze does the same as explain, but then runs the query, does some 
measurements, and enables you to do a reality check : is the estimated plan 
correct compared to reality (estimates of the returned rows, mainly), and the 
execution time of the query and its different steps.

>
> EXPLAIN .. query ..;
>
> After adding an index.    Is that what I'm supposed to do according to
> your recolection.  I kept using "explain" to determine if it would do a
> sequential scan or use the indexes as hoped.
>
>
>
> There are basically three queries.  One to get a list of directories,
> another to get a list of files and one to get a list of versions for a
> specific file.

>
> 1. To get a list of directories given a list of jobs.  Here is an
> example:
>
> SELECT DISTINCT Path.Path AS Path, File.PathId AS PathId FROM File INNER
> JOIN Path ON (File.PathId=Path.PathId) WHERE File.FilenameId=3 AND
> File.Jobid IN
> (441,440,439,438,437,436,435,434,433,432,431,430,429,428,427,426,425,424,42
>3,422,421,420,419,418,416,415,413,412,411,410,409,408,407,406,405,404,403,40
>2,400,399,398,397,396,395,394,393,392,391,385,384,383,382,381,379,378,376,37
>5,374,373,372,371) ORDER BY Path Time: 3988.458 ms

So you retrieve all the paths in one go ? What happens if you want to change 
the jobid ? Do You recalculate everything ?

>
> Query the file table for all records that have a filenameid where
> name='', the filenameid value is queried for earlier.  If the filenameid
> where filename='' is retrieved as a subquery, the analyze does not do as
> well, especially with a partial index.  Therfore I have chosen an index
> which is either on (filenameid and jobid) or (a partial index on jobid
> where filenameid=3).  Either one of these seems to perform considerably
> better than separate indexe on filenameid and jobid.

The first one will be easier to maintain, because the 'empty' filename 
filenameid may not be the same from database to database

>
> I believe one of these performs better due to the file table containing
> a very large number of records.
> bacula=# SELECT COUNT(Fileid) FROM File;
>   count
> ----------
>  14049283
> (1 row)
> bacula=# SELECT COUNT(Fileid) FROM File Where FilenameId=5;
>   count
> ---------
>  1834962
> (1 row)
> The analyze shows the query geting a list of the directories for a
> certain specified list of jobs, we are only interested in the records
> where pathid=5 (or the pathid for pathname='').  Then if this set of
> records is indexed on the jobs, the amount of time the query takes is
> mostly determined by the number of
>
> So out of 14 million records in the file table, 1.8 million are for
> directories.  The record count to process is reduced in the query from
> 14 million to 1.8 million by using a multi columned index.
>
> If both indexes are available to explain, it will use the partial index.

It's smaller, so that's normal as the costs are lower...

>
>
>
> 2. get a list of filenames in a certain path for a list of jobs.
> Example:
>
> SELECT DISTINCT Filename.Name AS FileName, Filename.FilenameId AS
> FilenameId FROM File  INNER JOIN Filename on
> (Filename.FilenameId=File.FilenameId) WHERE File.PathId=27604 AND
> File.Jobid IN
> (2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380
>,2363,2348,2333,2318,2303,2286,2271) AND Filename.Name!='' ORDER BY FileName
>
> This one is easy.  EXPLAIN says it will use the index in Pathid and
> performs it in Time: 4.383 ms.  This index does not exist in the install
> script.

Yes, no problem here as the criterion on PathId is restrictive enough

>
> 3. get a list of versions of a file.  Example:
>
> SELECT Job.JobId AS JobId, Job.Level AS Type, Job.EndTime AS EndTime,
> File.Md5 AS MD5, File.FileId AS FileId FROM File INNER JOIN Filename on
> (Filename.FilenameId=File.FilenameId) INNER JOIN Path ON
> (Path.PathId=File.PathId) INNER JOIN Job ON (File.JobId=Job.JobId) WHERE
> Path.PathId=27604 AND Filename.FilenameId=1049929 AND Job.Jobid IN
> (2598,2584,2570,2556,2537,2519,2503,2488,2473,2458,2440,2425,2410,2395,2380
>,2363,2348,2333,2318,2303,2286,2271) ORDER BY Job.EndTime DESC
>
> This one is also easy, EXPLAIN says it will use the index on
> (filenameid, jobid), which already exists in the install script.
> Time: 42.523 ms
>
> So if anyone has any other thoughts about indexes, I sure would be open
> to looking them.

I haven't, except for my interrogation on the way you're dealing with paths : 
retrieving all the paths in one go seems very dangerous to me (we've been 
down this road with brestore).

For instance, here, we have a server with one million directories, with a full 
backup every week. It means retrieving all this just to be able to display 
the root tree of the server. It means retrieving a million records doing 
index scans on the file table, which will be a prohibitive waiting time for 
the user, and a very big waste of memory and network bandwidth. For this 
server, with the first try at brestore, there was nearly 200Mb of ram wasted 
just for this.

The way we solved it with brestore implied creating new (very compact, 2 
bigint columns) tables to store directories dependancies (which directories 
are stored in which one), and for which jobid a directory is visible. We 
didn't like it, but didn't see any other way of doing it efficiently: doing a 
query on Path to get all subdirectories of a directory was fast only for leaf 
records (you have to do a like 'directory%' where clause...), and of course a 
user browses from the root directory of a tree...

The benefit of brestore's way of doing it is that only the first user needing 
the information has to compute it... or you can even pre-compute it early in 
the morning, after your backups are done. Of course, the problem is that the 
code is a bit complex for this part. But it simplified all the rest (browsing 
the tree) a great deal.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to