Re: [PERFORM] TSearch2 and optimisation ...

George Essig Sun, 12 Sep 2004 20:04:39 -0700

--- Herv�<inputPiedvache <[EMAIL PROTECTED]> wrote:

> George,
> 
> I have well read many pages about this subject ... but I have not found any 
> thing for the moment to really help me ...
> What can I do to optimize my PostgreSQL configuration for a special use of 
> Tsearch2 ...
> I'm a little dispointed looking the Postgresql Russian search engine using 
> Tsearch2 is really quick ... why I can't haev the same result with a 
> bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 
> records ?
> 
> Regards,
> -- 
> Herv�<inputPiedvache
> 
> Elma Ing�<inputierie Informatique
> 6 rue du Faubourg Saint-Honor�<input> F-75008 - Paris - France
> Pho. 33-144949901
> Fax. 33-144949902
>

Tsearch does not scale indefinitely. It was designed for fast online updates and to
be integrated
into PostgreSQL. My understanding is that it uses a bloom filter together with bit
string
signatures. Typically, full text searches use inverted indexes, scale better, but are
slower to
update.

My understanding is that tsearch has a practical limit of 100,000 distinct word stems
or lexemes.
Note that word stems are not words. Word stems are what are actually stored in a
tsvector after
parsing and dictionary processing.

The key to making tsearch fast is to keep the number of word stems low. You decrease
the number
of word stems by using stop words, various dictionaries, synonyms, and preprocessing
text before
it gets to tsearch. You can find what word stems are stored in a tsvector column by
using the
stat function. For examples of how to use the stat function, see:

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html

Note that the stat function will take a long time to run on large tables.

Performance tuning must be done on a case by case basis. It can take some time to try
different
things and see the change in performance. Each time you try something new, use the
stat function
to see how the number of word stems has changed.

The largest project I used tsearch2 on contained 900,000 records. Without performance
tuning,
there were 275,000 distinct word stems. After performance tuning, I got it down to
14,000
distinct word stems.

By using the stat function, I noticed some obvious stop words that were very frequent
that nobody
would ever search for. For how to use stop words, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html

Also I noticed some strange patterns by looking through all of the word stems.

In one case, strings of 3 to 7 words were joined together with hyphens to indicate
category
nesting. Tsearch would store these long hyphenated words intact and also store the
stem of each
individual word. I made a judgment call that no one would ever search for the long
hyphenated
words, so I preprocessed the text to remove the hyphens.

I also noticed that many of the word stems were alphanumeric IDs that were designed to
be unique.
There were many of these IDs in the tsvector column although each ID would occur only
once or
twice. I again preprocessed the text to remove these IDs, but created a btree index
on a varchar
column representing the IDs. My search form allows users to either search full text
using
tsearch2 or search IDs using 'LIKE' queries which use a btree index. For 'LIKE'
queries, it was
another matter to get postgres to use the btree index and not use a sequential scan.
For this,
see:

http://www.postgresql.org/docs/7.4/static/indexes-opclass.html

Last, I noticed that most users wanted to restrict the full text search to a subset
determined by
another column in the table. As a result, I created a multicolumn gist index on an
integer column
and a tsvector column. For how to setup a multicolumn gist index, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html

There are no easy answers. Like I said, performance tuning must be done on a case by
case basis.

Hope this helps,

George Essig

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] TSearch2 and optimisation ...

Reply via email to