Re: [PERFORM] tsearch2/GIST performance factors?

Oleg Bartunov Mon, 17 Oct 2005 11:08:32 -0700

On Sat, 15 Oct 2005, Craig A. James wrote:

We are indexing about 5 million small documents using tsearch2/GIST. Each"document" contains 2 to 50 words. This is a "write once, read many"situation. Write performance is unimportant, and the database contents arestatic. (We build it offline.)
We're having problems with inconsistent performance, and it's very hard toseparate the effects of various factors. Here are the things we think may berelevant.
1. Total number of words
Our documents currently contain about 110,000 unique words. Oleg wrote:"[The limit is] 100K, but it's very fuzzy limit." By trial and error, we'velearned that 50,000 works well, and 150,000 works poorly, so Oleg's commentappears to be a good rule-of-thumb. (With SIGLENINT enlarged, see below.)But there may be other factors that affect this conclusion (such as sharedmemory, total memory, etc.).


Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15

2. Total size of the table
5 million documents is not a very big database (each document is a few to afew hundred bytes), so we don't think this is relevant.
3. Number of documents per word
There seems to be a VERY strong effect related to "common" words. When aword occurs in more than about 1% of the documents (say 50,000 to 150,000documents), performance goes WAY down. Not just for that specific query, butit screws up tsearch2/GIST completely.
We have a test of 100 queries that return 382,000 documents total. The firsttime we run it, it's slow, about 20 minutes (as expected). The second timewe run it, it's very fast, about 72 seconds -- very fast!! As long as weavoid queries with common words, performance is very good.
But, if we run just one query that contains a common word (a word that's inmore than about 2% of the documents, roughly 150,000 documents), then thenext time we run the 100 test queries, it will take 20 minutes again.

We can't simply eliminate these common words. First of all, they can be verysignificant. Second, it doesn't seem like 2% is "common". I can understandthat a words like "the" which occur in most documents shouldn't be indexed.But a word that occurs in 2% of the database seems like a very good word toindex, yet it causes us great problems.


tsearch2's index is a lossy index, read 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
so search results  should be rechecked !

I've read a bit about tsearchd, and wonder if it would solve our problem.For our application, consistent performance is VERY important. If we couldlock the GIST index into memory, I think it would fix our problem.


I think so, tsearchd was designed for static contents in mind and it's
index doesn't require rechecking !

I tried copying the GIST indexes (which are in a separate tablespace) to a 1GB RAM disk, and it made the initial query faster, but overall performanceseemed worse, probably because the RAM disk was using memory that could havebeen used by the file-system cache.
4. Available RAM and Disk drives
Would more RAM help? How would we tell Postgres to use it effectively? TheGIST indexes are currently about 2.6 GB on the disk.

try to decrease signature size, say,#define SIGLENINT 15

I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST areinteracting in ways that I can't predict or analyze. Thanks very much forany comments and advice.


We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [PERFORM] tsearch2/GIST performance factors?

Reply via email to