Re: [PERFORM] fast DISTINCT or EXIST

Arjen van der Meijden Sat, 07 Apr 2007 05:33:51 -0700

Can't you use something like this? Or is the distinct on the t.cd_idstill causing the major slowdown here?


SELECT ... FROM cd
  JOIN tracks ...
WHERE cd.id IN (SELECT DISTINCT t.cd_id FROM tracks t
     WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 10)

If that is your main culprit, you could also use two limits based on thefact that there will be at most X songs per cd which would match yourtitle (my not very educated guess is 3x). Its a bit ugly... but if thatis what it takes to make postgresql not scan your entire index, so be it...


SELECT ... FROM cd
  JOIN tracks ...
WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t

WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30)as foo LIMIT 10)



Best regards,

Arjen

On 7-4-2007 12:47 Tilo Buschmann wrote:

Hello,

I am trying to build a application to search CDs and their tracks and I
am experiencing some performance difficulties.

The database is very simple at the moment, two tables "cd" and "tracks"
contain the CD-information and their respective tracks. A column
"cd_id" in public.tracks is the foreign key to the cd table.

#v+
                                  Table "public.cd"
   Column    |       Type        |                     Modifiers
-------------+-------------------+----------------------------------------------------
 revision    | integer           | not null default 0
 disc_length | integer           |
 via         | character varying |
 cd_id       | integer           | not null default 
nextval('cd_cd_id_seq'::regclass)
 discid      | integer           | not null
 title       | character varying | not null
 artist      | character varying | not null
 year        | smallint          |
 genre       | character varying |
 ext         | character varying |
 tstitle     | tsvector          |
 tsartist    | tsvector          |
Indexes:
    "cd_id_key" PRIMARY KEY, btree (cd_id)
    "discid_key" UNIQUE, btree (discid)
    "tsartist_cd_idx" gist (tsartist)
    "tstitle_cd_idx" gist (tstitle)
Check constraints:
    "year_check" CHECK ("year" IS NULL OR "year" >= 0 AND "year" <= 10000)
Tablespace: "d_separate"

                                   Table "public.tracks"

Column | Type | Modifiers----------+-------------------+-----------------------------------------------------------

 track_id | integer           | not null default 
nextval('tracks_track_id_seq'::regclass)
 cd_id    | integer           | not null

    "tracks_pkey" PRIMARY KEY, btree (track_id)
    "cdid_tracks_idx" btree (cd_id)
    "tsartist_tracks_idx" gist (tsartist)
    "tstitle_tracks_idx" gin (tstitle)
Foreign-key constraints:
    "tracks_cd_id_fkey" FOREIGN KEY (cd_id) REFERENCES cd(cd_id) ON UPDATE 
RESTRICT ON DELETE RESTRICT
Tablespace: "d_separate"

#v-

I am using tsearch2 to be able to search very fast for CD and track
artists and titles.

The database is created only once and I expect SELECTS to happen very
often, therefore the indexes will not hurt the performance. I also ran
a VACUUM FULL ANALYSE.

The query that I want to optimise at the moment is the "Give me all CDs
with their tracks, that contain a track with the Title 'foobar'". The
query is very expensive, so I try to limit it to 10 cds at once.

My first idea was:

#+
cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.title,cd.artist,tracks.title FROM 
tracks JOIN (SELECT cd.cd_id,cd.artist,cd.title FROM cd JOIN tracks USING 
(cd_id) WHERE tracks.tstitle @@ plainto_tsquery('simple','education') LIMIT 10) 
AS cd USING (cd_id);

QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------------------------------

 Nested Loop  (cost=0.00..3852.42 rows=11974 width=91) (actual 
time=310.983..972.739 rows=136 loops=1)
   ->  Limit  (cost=0.00..121.94 rows=10 width=46) (actual 
time=264.797..650.178 rows=10 loops=1)
         ->  Nested Loop  (cost=0.00..227602.43 rows=18665 width=46) (actual 
time=264.793..650.165 rows=10 loops=1)
               ->  Index Scan using tstitle_tracks_idx on tracks  
(cost=0.00..73402.74 rows=18665 width=4) (actual time=155.516..155.578 rows=10 
loops=1)
                     Index Cond: (tstitle @@ '''education'''::tsquery)
               ->  Index Scan using cd_id_key on cd  (cost=0.00..8.25 rows=1 
width=46) (actual time=49.452..49.453 rows=1 loops=10)
                     Index Cond: (public.cd.cd_id = public.tracks.cd_id)
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 
width=27) (actual time=29.588..32.239 rows=14 loops=10)
         Index Cond: (public.tracks.cd_id = cd.cd_id)
 Total runtime: 972.917 ms
(10 rows)
#v-


The query is fast enough, but erroneous. If a cd contains more than one
track, that matches the condition, the inner SELECT will return more
than one cd and therefore the whole query will shield duplicate cds.

The solution is to either insert DISTINCT into the above query or use
EXISTS as condition, but both queries show a terrible performance:

#v+
cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.title,cd.artist,tracks.title FROM 
tracks JOIN (SELECT DISTINCT cd.cd_id,cd.artist,cd.title FROM cd JOIN tracks 
USING (cd_id) WHERE tracks.tstitle @@ plainto_tsquery('simple','education') 
LIMIT 10) AS cd USING (cd_id);
                                                                           
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=152390.12..156120.71 rows=11974 width=91) (actual 
time=37356.517..37605.073 rows=137 loops=1)
   ->  Limit  (cost=152390.12..152390.22 rows=10 width=46) (actual 
time=37289.598..37289.643 rows=10 loops=1)
         ->  Unique  (cost=152390.12..152576.77 rows=18665 width=46) (actual 
time=37289.594..37289.629 rows=10 loops=1)
               ->  Sort  (cost=152390.12..152436.79 rows=18665 width=46) 
(actual time=37289.590..37289.601 rows=12 loops=1)
                     Sort Key: public.cd.cd_id, public.cd.artist, 
public.cd.title
                     ->  Hash Join  (cost=78926.50..151066.02 rows=18665 
width=46) (actual time=36214.504..37285.974 rows=811 loops=1)
                           Hash Cond: (public.tracks.cd_id = public.cd.cd_id)
                           ->  Bitmap Heap Scan on tracks  
(cost=536.76..59707.31 rows=18665 width=4) (actual time=0.724..39.253 rows=811 
loops=1)
                                 Recheck Cond: (tstitle @@ 
'''education'''::tsquery)
                                 ->  Bitmap Index Scan on tstitle_tracks_idx  
(cost=0.00..532.09 rows=18665 width=0) (actual time=0.492..0.492 rows=811 loops=1)
                                       Index Cond: (tstitle @@ 
'''education'''::tsquery)
                           ->  Hash  (cost=49111.33..49111.33 rows=1344433 
width=46) (actual time=36211.598..36211.598 rows=1344433 loops=1)
                                 ->  Seq Scan on cd  (cost=0.00..49111.33 
rows=1344433 width=46) (actual time=31.094..19813.716 rows=1344433 loops=1)
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 
width=27) (actual time=31.294..31.527 rows=14 loops=10)
         Index Cond: (public.tracks.cd_id = cd.cd_id)
 Total runtime: 37614.523 ms
(16 rows)

cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.artist,cd.title,tracks.title FROM 
tracks JOIN (SELECT cd.cd_id,cd.artist,cd.title FROM cd WHERE EXISTS (SELECT 1 
FROM tracks WHERE tracks.cd_id = cd.cd_id AND tracks.tstitle @@ 
plainto_tsquery('simple','education')) LIMIT 10) as cd USING (cd_id);

QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------------

 Nested Loop  (cost=0.00..10023.37 rows=11974 width=91) (actual 
time=126.607..40853.563 rows=148 loops=1)
   ->  Limit  (cost=0.00..6292.89 rows=10 width=46) (actual 
time=126.587..40853.072 rows=10 loops=1)
         ->  Seq Scan on cd  (cost=0.00..423018283.46 rows=672216 width=46) 
(actual time=126.584..40853.035 rows=10 loops=1)
               Filter: (subplan)
               SubPlan
                 ->  Index Scan using cdid_tracks_idx on tracks  
(cost=0.00..314.61 rows=1 width=0) (actual time=1.025..1.025 rows=0 loops=39706)
                       Index Cond: (cd_id = $0)
                       Filter: (tstitle @@ '''education'''::tsquery)
   ->  Index Scan using cdid_tracks_idx on tracks  (cost=0.00..358.08 rows=1197 
width=27) (actual time=0.011..0.029 rows=15 loops=10)
         Index Cond: (tracks.cd_id = cd.cd_id)
 Total runtime: 40853.789 ms
(11 rows)
#v-

Rephrasing the EXISTS-query as an IN-query did not help the
performance, either.

I get the impression, that I am blind and cannot find the obvious
solution, do you have any idea how to accomplish, what I am trying?

Best Regards,

Tilo

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

Re: [PERFORM] fast DISTINCT or EXIST

Reply via email to