Richard Huxton wrote:
Madison Kelly wrote:

Hi all,

Is there a fast(er) way to get the sum of all integer values for a certain condition over many thousands of rows? What I am currently doing is this (which takes ~5-10sec.):

OK, I'm assuming you've configured PG to your satisfaction and this is the only query giving you problems.

This is a program for general consumption (hopefully... eventually...) so I want to leave the psql config alone. Once I am happier with the program I will try different tuning options and write a faq though I expect 9 out of 10 users won't read it.

SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type AND b.fs_backup='t';

You'll want to run EXPLAIN ANALYSE SELECT SUM... and post the output of that, although the query looks straightforward enough.

tle-bu=> EXPLAIN ANALYZE SELECT SUM (a.file_size) FROM file_info_1 a, file_set_1 b WHERE a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type AND b.fs_backup='t';

Aggregate (cost=2202.54..2202.54 rows=1 width=8) (actual time=5078.744..5078.748 rows=1 loops=1)
-> Merge Join (cost=724.94..2202.51 rows=11 width=8) (actual time=3281.677..4969.719 rows=12828 loops=1)
Merge Cond: (("outer".file_parent_dir = "inner".fs_parent_dir) AND ("outer".file_name = "inner".fs_name) AND ("outer".file_type = "inner".fs_type))
-> Index Scan using file_info_1_search_idx on file_info_1 a (cost=0.00..1317.11 rows=12828 width=104) (actual time=0.042..116.825 rows=12828 loops=1)
-> Sort (cost=724.94..740.97 rows=6414 width=96) (actual time=3281.516..3350.640 rows=12828 loops=1)
Sort Key: b.fs_parent_dir, b.fs_name, b.fs_type
-> Seq Scan on file_set_1 b (cost=0.00..319.35 rows=6414 width=96) (actual time=0.029..129.129 rows=12828 loops=1)
Filter: (fs_backup = true)
Total runtime: 5080.729 ms
(9 rows)

  Here are the schemas, in case they help:

tle-bu=> \d file_info_1 Table "public.file_info_1"
Column | Type | Modifiers
file_acc_time | bigint | not null
file_group_name | text | not null
file_group_uid | integer | not null
file_mod_time | bigint | not null
file_name | text | not null
file_parent_dir | text | not null
file_perm | text | not null
file_size | bigint | not null
file_type | text | not null default 'f'::text
file_user_name | text | not null
file_user_uid | integer | not null
"file_info_1_display_idx" btree (file_parent_dir, file_name, file_type)
"file_info_1_search_idx" btree (file_parent_dir, file_name, file_type)

tle-bu=> \d file_set_1               Table "public.file_set_1"
    Column     |  Type   |         Modifiers
 fs_backup     | boolean | not null default true
 fs_display    | boolean | not null default false
 fs_name       | text    | not null
 fs_parent_dir | text    | not null
 fs_restore    | boolean | not null default false
 fs_type       | text    | not null default 'f'::text
    "file_set_1_sync_idx" btree (fs_parent_dir, fs_name, fs_type)

2. Why do you have two identical indexes on file_info_1
4. Am I right in thinking that always, file_name==fs_name (i.e. they represent the same piece of information) and if so, why are you storing it twice? Same for _parent_dir too
5. file_type/fs_type are being held as unbounded text? Not an index into some lookup table or a varchar(N)?

Can you explain what you're trying to do here - it might be you want to alter your database design.
Richard Huxton
Archonet Ltd

This is where I have to admit my novice level of knowledge. Until now I have been more concerned with "making it work". It is only now that I have finished (more or less) the program that I have started going back and trying to find ways to speed it up. I have not used postgres (or perl or anything) before this program. I hope my questions aren't too basic. ^.^;

I keep hearing about Primary Keys but I can't say that I know what they are or how they are used. If I do understand, it is a way to reference another table's entry (using a foreign key)? The two matching indexes is a typo in my program that I hadn't noticed, I'll fix that asap.

  Here is what the database is used for:

This is a backup program and I use the DB to store extended information on all selected files and directories on a partition. Each partition has it's own 'file_info_#' and 'file_set_#' tables where '#' matches the ID stored for that partition in the DB in another table.

The 'file_info_#' table stored the data that can change such as file size, last modified/accessed, owing user and group and so forth. The 'file_set_#' table stores the flags that say to include or exclude it from a backup/restore job and whether it has been selected for display in the file browser.

In the first iteration I -used- to have the data in a single table and I identified the partition with a column called 'file_in_id' (or something similar). As I looked at each file on the system I would do a db call to see if the entry existed and if so, update it and if not, insert it. This was horribly slow though so I decided to break out into the schema above.

With the schema above what I do now is just drop the 'file_info_#' table, recreate the table and matching indexes and then do a mass 'COPY' of all the file info on the partition. After this is done I read in the new data from the reloaded 'file_info_#' table and sync the data in 'file_set_#' which removes entries no longer in 'file_info_#', adds new ones matching the parent's values and leaves the existing entries alone.

I found droping the table and re-creating it a lot faster than a 'DELETE FROM' call and it also seems to have made 'VACUUM FULL' a lot faster.

Thank you very much for your feedback! I hope I haven't done something -too- foolish. :p If I have, I will change it.


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to