We are indexing about 5 million small documents using tsearch2/GIST.  Each "document" 
contains 2 to 50 words.  This is a "write once, read many" situation.  Write performance 
is unimportant, and the database contents are static.  (We build it offline.)

We're having problems with inconsistent performance, and it's very hard to 
separate the effects of various factors.  Here are the things we think may be 
relevant.

1. Total number of words

Our documents currently contain about 110,000 unique words.  Oleg wrote: "[The limit 
is] 100K, but it's very fuzzy limit." By trial and error, we've learned that 50,000 
works well, and 150,000 works poorly, so Oleg's comment appears to be a good 
rule-of-thumb.  (With SIGLENINT enlarged, as mentioned above.)  But there may be other 
factors that affect this conclusion (such as shared memory, total memory, etc.).


2. Total size of the table

8 million documents is not a very big database (each document is a few to a few 
hundred bytes), so we don't think this is relevant.


3. Number of documents per word

There seems to be a VERY strong effect related to "common" words.  When a word 
occurs in more than about 1% of the documents (say 50,000 to 150,000 documents), 
performance goes WAY down.  Not just for that specific query, but it screws up 
tsearch2/GIST completely.

We have a test of 100 queries that return 382,000 documents total.  The first 
time we run it, it's slow, about 20 minutes (as expected).  The second time we 
run it, it's very fast, about 72 seconds -- very fast!!  As long as we avoid 
queries with common words, performance is very good.

But, if we run just one query that contains a common word (a word that's in 
more than about 2% of the documents, roughly 150,000 documents), then the next 
time we run the 100 test queries, it will take 20 minutes again.

We can't simply eliminate these common words.  First of all, they can be very significant.  Second, 
it doesn't seem like 2% is "common".  I can understand that a words like "the" 
which occur in most documents shouldn't be indexed.  But a word that occurs in 2% of the database 
seems like a very good word to index, yet it causes us great problems.

I've read a bit about tsearchd, and wonder if it would solve our problem.  For 
our application, consistent performance is VERY important.  If we could lock 
the GIST index into memory, I think it would fix our problem.

I tried copying the GIST indexes (which are in a separate tablespace) to a 1 GB 
RAM disk, and it made the initial query faster, but overall performance seemed 
worse, probably because the RAM disk was using memory that could have been used 
by the file-system cache.


4. Available RAM and Disk drives

Would more RAM help?  How would we tell Postgres to use it effectively?  The 
GIST indexes are currently about 2.6 GB on the disk.

Would more disks help?  I know they would make it faster -- the 20-minute 
initial query would be reduce with a RAID drive, etc.  But I'm not concerned 
about the 20-minute initial query, I'm concerned about keeping the system in 
that super-fast state where the GIST indexes are all in memory.


Hardware:
  Dual-CPU Xeon Dell server with 4 GB memory and a single SATA 7200 RPM 150GB 
disk.

tsearch2/gistidx.h
  modified as: #define SIGLENINT  120

System configuration:
  echo 2147483648 >/proc/sys/kernel/shmmax
  echo 4096      >/proc/sys/kernel/shmmni
  echo 2097152   >/proc/sys/kernel/shmall

Postgres Configuration:
  shared_buffers = 20000        
  work_mem = 32768
  effective_cache_size = 300000

Thanks very much for any comments and advice.

Craig



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

Reply via email to