--- Herv�<inputPiedvache <[EMAIL PROTECTED]> wrote:

> George,
> I have well read many pages about this subject ... but I have not found any 
> thing for the moment to really help me ...
> What can I do to optimize my PostgreSQL configuration for a special use of 
> Tsearch2 ...
> I'm a little dispointed looking the Postgresql Russian search engine using 
> Tsearch2 is really quick ... why I can't haev the same result with a 
> bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 
> records ?
> Regards,
> -- 
> Herv�<inputPiedvache
> Elma Ing�<inputierie Informatique
> 6 rue du Faubourg Saint-Honor�<input> F-75008 - Paris - France
> Pho. 33-144949901
> Fax. 33-144949902

Tsearch does not scale indefinitely.  It was designed for fast online updates and to 
be integrated
into PostgreSQL.  My understanding is that it uses a bloom filter together with bit 
signatures.  Typically, full text searches use inverted indexes, scale better, but are 
slower to

My understanding is that tsearch has a practical limit of 100,000 distinct word stems 
or lexemes. 
Note that word stems are not words.  Word stems are what are actually stored in a 
tsvector after
parsing and dictionary processing.

The key to making tsearch fast is to keep the number of word stems low.  You decrease 
the number
of word stems by using stop words, various dictionaries, synonyms, and preprocessing 
text before
it gets to tsearch.  You can find what word stems are stored in a tsvector column by 
using the
stat function.  For examples of how to use the stat function, see:


Note that the stat function will take a long time to run on large tables.

Performance tuning must be done on a case by case basis.  It can take some time to try 
things and see the change in performance.  Each time you try something new, use the 
stat function
to see how the number of word stems has changed.

The largest project I used tsearch2 on contained 900,000 records.  Without performance 
there were 275,000 distinct word stems.  After performance tuning, I got it down to 
distinct word stems.  

By using the stat function, I noticed some obvious stop words that were very frequent 
that nobody
would ever search for.  For how to use stop words, see:


Also I noticed some strange patterns by looking through all of the word stems.  

In one case, strings of 3 to 7 words were joined together with hyphens to indicate 
nesting.  Tsearch would store these long hyphenated words intact and also store the 
stem of each
individual word.  I made a judgment call that no one would ever search for the long 
words, so I preprocessed the text to remove the hyphens. 

I also noticed that many of the word stems were alphanumeric IDs that were designed to 
be unique. 
There were many of these IDs in the tsvector column although each ID would occur only 
once or
twice.  I again preprocessed the text to remove these IDs, but created a btree index 
on a varchar
column representing the IDs.  My search form allows users to either search full text 
tsearch2 or search IDs using 'LIKE' queries which use a btree index.  For 'LIKE' 
queries, it was
another matter to get postgres to use the btree index and not use a sequential scan.  
For this,


Last, I noticed that most users wanted to restrict the full text search to a subset 
determined by
another column in the table.  As a result, I created a multicolumn gist index on an 
integer column
and a tsvector column.  For how to setup a multicolumn gist index, see:


There are no easy answers.  Like I said, performance tuning must be done on a case by 
case basis.

Hope this helps,

George Essig

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to