Re: [PERFORM] Tsearch2 performance on big database

Oleg Bartunov Wed, 23 Mar 2005 01:43:12 -0800

On Wed, 23 Mar 2005, Rick Jansen wrote:

Oleg Bartunov wrote:
On Tue, 22 Mar 2005, Rick Jansen wrote:
Hmm, default configuration is too eager, you index every lexem using simple dictionary) ! Probably, it's too much. Here is what I have for my russian configuration in dictionary database:
 default_russian | lword        | {en_ispell,en_stem}
 default_russian | lpart_hword  | {en_ispell,en_stem}
 default_russian | lhword       | {en_ispell,en_stem}
 default_russian | nlword       | {ru_ispell,ru_stem}
 default_russian | nlpart_hword | {ru_ispell,ru_stem}
 default_russian | nlhword      | {ru_ispell,ru_stem}
Notice, I index only russian and english words, no numbers, url, etc.
You may just delete unwanted rows in pg_ts_cfgmap for your configuration,
but I'd recommend just update them setting dict_name to NULL.
For example, to not indexing integers:
update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' and tok_alias='int';

voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have +70000 bucks'); token | dict_name | tok_type | tsvector --------+---------------------+----------+---------- Do | {en_ispell,en_stem} | lword | you | {en_ispell,en_stem} | lword | have | {en_ispell,en_stem} | lword | +70000 | | int | bucks | {en_ispell,en_stem} | lword | 'buck'
Only 'bucks' gets indexed :)
Hmm, probably I should add this into documentation.
What about word statistics (# of unique words, for example).
I'm now following the guide to add the ispell dictionary and I've updated most of the rows setting dict_name to NULL:
    ts_name     |  tok_alias   | dict_name
-----------------+--------------+-----------
default         | lword        | {en_stem}
default         | nlword       | {simple}
default         | word         | {simple}
default         | part_hword   | {simple}
default         | nlpart_hword | {simple}
default         | lpart_hword  | {en_stem}
default         | hword        | {simple}
default         | lhword       | {en_stem}
default         | nlhword      | {simple}
These are left, but I have no idea what a 'hword' or 'nlhword' or any other of these tokens are.


from my notes 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
 I've asked how to know token types supported by parser. Actually, there is 
function token_type(parser), so you just use:

        select * from token_type();

Anyway, how do I find out the number of unique words or other word statistics?

from my notes http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes

It's usefull to see words statistics, for example, to check how good your dictionaries work or how did you configure pg_ts_cfgmap. Also, you may notice probable stop words relevant for your collection. Tsearch provides stat() function:

.......................

Don't hesitate to read it and if you find some bugs or know better wording
I'd be glad to improve my notes.


Rick


    Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match

Re: [PERFORM] Tsearch2 performance on big database

Reply via email to