[SQL] 3 tables, slow count(*), order by Seq Scan in Query Plan

ogjunk-pgjedan Thu, 26 May 2005 21:06:40 -0700

Hello,

I have 3 tables (2 tables + 1 lookup table that ties them) and running
a straight-forward aggregate count(*) query with a couple of joins
takes about 10 seconds (and I need it to be sub-second or so).
Also, I am wondering if this approach is scalable with my row-counts
and my hardware (below).


My slow query is this:
----------------------
SELECT keyword.name, count(*)
FROM user_data, user_data_keyword, keyword
WHERE (user_data.user_id = 1)
      AND (user_data.id = user_data_keyword.user_data_id)
      AND (user_data_keyword.keyword_id = keyword.id)
      GROUP BY keyword.name
      ORDER BY COUNT(*) DESC LIMIT 10;


QUERY PLAN from EXPLAIN ANALYZE:
--------------------------------
 Limit  (cost=27820.00..27820.03 rows=10 width=114) (actual
time=9971.322..9971.401 rows=10 loops=1)
   ->  Sort  (cost=27820.00..27822.50 rows=1000 width=114) (actual
time=9971.312..9971.338 rows=10 loops=1)
         Sort Key: count(*)
         ->  HashAggregate  (cost=27767.67..27770.17 rows=1000
width=114) (actual time=9955.457..9963.051 rows=2005 loops=1)
               ->  Hash Join  (cost=4459.64..27738.80 rows=5774
width=114) (actual time=1140.776..9919.852 rows=5516 loops=1)          
          Hash Cond: ("outer".keyword_id = "inner".id)
                     ->  Hash Join  (cost=4437.14..27600.81 rows=5774
width=4) (actual time=21.781..7804.329 rows=5516 loops=1)
                           Hash Cond: ("outer".user_data_id =
"inner".id)
                           ->  Seq Scan on user_data_keyword 
(cost=0.00..17332.29 rows=1154729 width=8) (actual time=2.717..3834.186
rows=1154729 loops=1)
                           ->  Hash  (cost=4433.94..4433.94 rows=1278
width=4) (actual time=18.862..18.862 rows=0 loops=1)
                                 ->  Index Scan using
ix_user_data_user_id_data_id on user_data  (cost=0.00..4433.94
rows=1278 width=4) (actual time=0.234..13.454 rows=1149 loops=1)
                                       Index Cond: (user_id = 1)
                     ->  Hash  (cost=20.00..20.00 rows=1000 width=118)
(actual time=1118.616..1118.616 rows=0 loops=1)
                           ->  Seq Scan on keyword  (cost=0.00..20.00
rows=1000 width=118) (actual time=1.140..609.577 rows=105440 loops=1)
 Total runtime: 9972.704 ms
(15 rows)


Ouch :)

I'm trying to analyze the query plan (I'm not very good at it,
obviously), and I see 2 Sequential Scans, one on the _big_
"user_data_keyword" table with about 60% of the total cost, and one of
the "keyword".
I also see HashAggregate with a high cost and a long actual time.

I'm not sure what to do, which indices I need to add, as the
"user_data_keyword" and "keyword" tables already have PK-based btree
indices:
  "user_data_keyword" has: ... btree (user_data_id, keyword_id)
  "keyword"           has: ... btree (id)


Here are my 3 tables:

user_data (about 300K rows currently, will grow to 10M+)
---------
  id           (PK),
  user_id      (FK)
  ... other columns ...

user_data_keyword (lookup table - size always 4 x user_data)
-----------------
  user_data_id (FK)
  keyword_id   (FK)
  PK(user_data_id, keyword_id)

keyword (size about 10 x smaller than user_data_keyword)
-------
  id           (PK)
  name         VARCHAR(64) NOT NULL UNIQUE
  add_date     TIMEZONE



Is there any way of speeding up my query?

Also, given the number of rows expected in those tables:
user_data:         10M
user_data_keyword: 40M
keyword:            4M

Any ideas how a query like this will scale when i hit those numbers?

This particular query speed numbers are from 7.4.6 on a 1.5GHz laptop,
but the production server is running PG 8.0.3 on a 3.0GHz P4 with 2
SATA disks in RAID1 config and 1GB RAM.  How realistic is it to get
sub-second responses on such hardware given the above numbers?

Thanks,
Otis


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

[SQL] 3 tables, slow count(*), order by Seq Scan in Query Plan

Reply via email to