First let me explain the situation:

I came into the #postgresql irc channel talking about this problem, and someone advised me to use this mailing list (so i'm not just wasting your time i hope). I'm not sure how to describe the problem fully, so I'll start by explaining what my database does, a little about the table structure, then an example of a problematic query, and some other information that might be relevant.

My database is a Chinese-English/English-Chinese dictionary. It lets users search Chinese words by any character in the word, or any sequence of characters (starting anywhere the user wants). Characters are often searched by their pinyin values (a romanization of the sounds). The dictionary has an average word length of 2 (but it sucks), but there are also many words of length 4 and 6 characters. So it wouldn't be uncommon to search for something like "a1 la1 bo yu" (arabic). There are also some very long words with 12 or more characters (where my problem becomes more pronounced).

That being said, the most important table here is the words table:
             Table "public.words"
   Column    |         Type         | Modifiers
--------------+----------------------+-----------
wid          | integer              | not null
sequence     | smallint             | not null
variant      | smallint             | not null
char_count   | smallint             | not null
unicode      | character varying(5) | not null
pinyin       | character varying(8) | not null
simpvar      | character varying(5) |
zvar         | character varying(5) |
compatvar    | character varying(5) |
def_exists   | boolean              | not null
num_variants | smallint             |
page_order   | integer              |
pinyins      | character varying    |
unicodes     | character varying    |
Indexes:
   "words2_pkey" primary key, btree (wid, variant, "sequence")
   "page_index" btree (page_order)
   "pinyin_index" btree (pinyin)
   "unicode_index" btree (unicode)


The best example of the problem I have when using this table is this query:


SELECT
   w8.wid,
   w8.variant,
   w8.num_variants,
   sum_text(w8.unicode) as unicodes,
   sum_text(w8.pinyin) as pinyins
FROM
   words as w0,     words as w1,
   words as w2,     words as w3,
   words as w4,     words as w5,
   words as w6,     words as w7,
   words as w8
WHERE
   w0.wid > 0 AND
   w0.pinyin = 'zheng4' AND
   w0.def_exists = 't' AND
   w0.sequence = 0 AND
   w1.wid = w0.wid AND
   w1.pinyin LIKE 'fu_' AND
   w1.variant = w0.variant AND
   w1.sequence = (w0.sequence + 1) AND
   w2.wid = w1.wid AND
   w2.pinyin LIKE 'ji_' AND
   w2.variant = w1.variant AND
   w2.sequence = (w1.sequence + 1) AND
   w3.wid = w2.wid AND
   w3.pinyin LIKE 'guan_' AND
   w3.variant = w2.variant AND
   w3.sequence = (w2.sequence + 1) AND
   w4.wid = w3.wid AND
   w4.pinyin LIKE 'kai_' AND
   w4.variant = w3.variant AND
   w4.sequence = (w3.sequence + 1) AND
   w5.wid = w4.wid AND
   w5.pinyin LIKE 'fang_' AND
   w5.variant = w4.variant AND
   w5.sequence = (w4.sequence + 1) AND
   w6.wid = w5.wid AND
   w6.pinyin LIKE 'xi_' AND
   w6.variant = w5.variant AND
   w6.sequence = (w5.sequence + 1) AND
   w7.wid = w6.wid AND
   w7.pinyin LIKE 'tong_' AND
   w7.variant = w6.variant AND
   w7.sequence = (w6.sequence + 1) AND
   w8.wid = w7.wid AND
   w8.variant = w7.variant
GROUP BY
   w8.wid,
   w8.variant,
   w8.num_variants,
   w8.page_order ,
   w0.sequence ,
   w1.sequence ,
   w2.sequence ,
   w3.sequence ,
   w4.sequence ,
   w5.sequence ,
   w6.sequence ,
   w7.sequence
ORDER BY
   w8.page_order;

(phew!)

with the default geqo_threshold of 11, this query takes 3155ms on my machine (a 1ghz athlon with 384 megs of pc133 ram). This is very very long.

if i first do prepare blah as SELECT ....., then run execute blah, the time goes down to about 275ms (i had been running this query a lot, and did a vacuum update before all this).

the ouput from EXPLAIN ANALYZE :

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=54.13..54.14 rows=1 width=43) (actual time=315.357..315.357 rows=1 loops=1)
Sort Key: w8.page_order
-> HashAggregate (cost=54.12..54.12 rows=1 width=43) (actual time=315.328..315.330 rows=1 loops=1)
-> Nested Loop (cost=0.00..54.08 rows=1 width=43) (actual time=6.229..314.566 rows=12 loops=1)
Join Filter: (("outer".wid = "inner".wid) AND ("outer".variant = "inner".variant) AND ("outer"."sequence" = ("inner"."sequence" + 1)) AND ("inner"."sequence" = ("outer"."sequence" + 1)))
-> Nested Loop (cost=0.00..48.07 rows=1 width=83) (actual time=6.088..279.745 rows=12 loops=1)
Join Filter: (("inner"."sequence" = ("outer"."sequence" + 1)) AND ("outer"."sequence" = ("inner"."sequence" + 1)))
-> Nested Loop (cost=0.00..42.05 rows=1 width=75) (actual time=5.980..278.602 rows=12 loops=1)
-> Nested Loop (cost=0.00..36.04 rows=1 width=48) (actual time=5.910..278.280 rows=1 loops=1)
Join Filter: (("inner".variant = "outer".variant) AND ("inner".wid = "outer".wid))
-> Nested Loop (cost=0.00..30.04 rows=1 width=40) (actual time=3.465..275.137 rows=1 loops=1)
Join Filter: ("inner"."sequence" = ("outer"."sequence" + 1))
-> Nested Loop (cost=0.00..24.03 rows=1 width=32) (actual time=3.408..275.045 rows=1 loops=1)
Join Filter: ("outer"."sequence" = ("inner"."sequence" + 1))
-> Nested Loop (cost=0.00..18.00 rows=1 width=24) (actual time=3.350..274.948 rows=1 loops=1)
-> Nested Loop (cost=0.00..11.99 rows=1 width=16) (actual time=3.295..274.678 rows=6 loops=1)
Join Filter: (("inner".wid = "outer".wid) AND ("inner".variant = "outer".variant) AND ("inner"."sequence" = ("outer"."sequence" + 1)))
-> Index Scan using pinyin_index on words w4 (cost=0.00..5.98 rows=1 width=8) (actual time=0.090..1.222 rows=165 loops=1)
Index Cond: (((pinyin)::text >= 'kai'::character varying) AND ((pinyin)::text < 'kaj'::character varying))
Filter: ((pinyin)::text ~~ 'kai_'::text)
-> Index Scan using pinyin_index on words w5 (cost=0.00..5.98 rows=1 width=8) (actual time=0.017..1.380 rows=259 loops=165)
Index Cond: (((pinyin)::text >= 'fang'::character varying) AND ((pinyin)::text < 'fanh'::character varying))
Filter: ((pinyin)::text ~~ 'fang_'::text)
-> Index Scan using words2_pkey on words w1 (cost=0.00..6.00 rows=1 width=8) (actual time=0.032..0.037 rows=0 loops=6)
Index Cond: (("outer".wid = w1.wid) AND ("outer".variant = w1.variant))
Filter: ((pinyin)::text ~~ 'fu_'::text)
-> Index Scan using words2_pkey on words w0 (cost=0.00..6.01 rows=1 width=8) (actual time=0.033..0.068 rows=1 loops=1)
Index Cond: (("outer".wid = w0.wid) AND (w0.wid > 0) AND ("outer".variant = w0.variant))
Filter: (((pinyin)::text = 'zheng4'::text) AND (def_exists = true) AND ("sequence" = 0))
-> Index Scan using words2_pkey on words w2 (cost=0.00..6.00 rows=1 width=8) (actual time=0.029..0.060 rows=1 loops=1)
Index Cond: ((w2.wid = "outer".wid) AND (w2.variant = "outer".variant))
Filter: ((pinyin)::text ~~ 'ji_'::text)
-> Index Scan using pinyin_index on words w7 (cost=0.00..5.98 rows=1 width=8) (actual time=0.030..2.573 rows=338 loops=1)
Index Cond: (((pinyin)::text >= 'tong'::character varying) AND ((pinyin)::text < 'tonh'::character varying))
Filter: ((pinyin)::text ~~ 'tong_'::text)
-> Index Scan using words2_pkey on words w8 (cost=0.00..5.99 rows=1 width=27) (actual time=0.029..0.130 rows=12 loops=1)
Index Cond: ((w8.wid = "outer".wid) AND (w8.variant = "outer".variant))
-> Index Scan using words2_pkey on words w6 (cost=0.00..6.00 rows=1 width=8) (actual time=0.040..0.060 rows=1 loops=12)
Index Cond: ((w6.wid = "outer".wid) AND (w6.variant = "outer".variant))
Filter: ((pinyin)::text ~~ 'xi_'::text)
-> Index Scan using pinyin_index on words w3 (cost=0.00..5.98 rows=1 width=8) (actual time=0.023..2.312 rows=304 loops=12)
Index Cond: (((pinyin)::text >= 'guan'::character varying) AND ((pinyin)::text < 'guao'::character varying))
Filter: ((pinyin)::text ~~ 'guan_'::text)
Total runtime: 316.493 ms
(44 rows)


Time: 3167.853 ms



As you can see, the two run times there are quite different... The person I spoke to in the irc channel said this all indicated a poor planning time, and I think I agree. Yesterday I tried setting geqo_threshold to 7 instead of the default of 11, and it seemed to help a little, but the running times were still extremely high.

I guess I do have a question in addition to just wanting to notify the right people of this problem: Since a lot of my queries are similar to this one (but not similar enough to allow me to use one or two of them over and over with different parameters), is there any way for me to reorganize or rewrite the queries so that the planner doesn't take so long? (I would hate to have to take all of this out of the db's hands and iterate in code myself...)

If you guys are optimistic about someone being able to fix this problem in pgsql, I will just wait for the bug fix.


Thanks for listening :) let me know if you need any more information (oh yea, this is on Linux, version 7.4.1)


_________________________________________________________________
Create a Job Alert on MSN Careers and enter for a chance to win $1000! http://msn.careerbuilder.com/promo/kaday.htm?siteid=CBMSN_1K&sc_extcmp=JS_JASweep_MSNHotm2



---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match

Reply via email to