At our website we had a "in database" search as well... It was terribly slow (it was a custom built vector space model implemented in mysql+php so that explains a bit).
We replaced it by the Xapian library (www.xapian.org) with its Omega frontend as a middle end. I.e. we call with our php-scripts the omega search frontend and postprocess the results with the scripts (some rights double checks and so on), from the results we build a very simpel
SELECT ... FROM documents ... WHERE docid IN implode($docids_array)
(you understand enough php to understand this, I suppose)
With our 10GB of tekst, we have a 14GB (uncompressed, 9G compressed orso) xapian database (the largest part is for the 6.7G positional table), I'm pretty sure that if we'd store that information in something like tsearch it'd be more than that 14GB...
Searches take less than a second (unless you do phrase searches of course, that takes a few seconds and sometimes a few minutes).
I did a query on 'ext3 undelete' just a few minutes ago and it did the search in 827150 documents in only 0.027 (a second run 0.006) seconds (ext3 was found in 753 and undelete in 360 documents). Of course that is excluding the results parsing, the total time to create the webpage was "much" longer (0.43 seconds orso) due to the fact that the results needs to be transferred via xinetd and the results needs to be extracted from mysql (which is terrible with the "search supporting queries" we issue :/ ) Our search machine is very similar the machine you use as database, but it doesn't do much heavy work apart from running the xapian/omega search combination.
If you are interested in this, I can provide (much) more information about our implementation. Since you don't need right-checks, you could even get away with just the omega front end all by itself (it has a nice scripting language, but can't interface with anything but xapian).
The main advantage of taking this out of your sql database is that it runs on its own custom built storage system (and you could offload it to another machine, like we did).
Btw, if you really need an "in database" solution, read back the postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's working on integrating xapian in postgresql as a FTI)
Best regards,
Arjen van der Meijden
Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G, 4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is 3x72G drive in a RAID5 configuration, and the database server is 7.4 ... the mnogosearch folk use mysql for their development, so its possible there is something they are doing that is slowing this process down, to compensate for a fault in mysql, but this is ridiculous ...
note that I have it setup with what the mnogosearch folk lists as being 'the fastest schema for large indexes' or 'crc-multi' ...
right now, we're running only 373k docs:
isvr5# indexer -S
Database statistics
Status Expired Total ----------------------------- 415 0 311 Unsupported Media Type 302 0 1171 Moved Temporarily 502 0 43 Bad Gateway 414 0 3 Request-URI Too Long 301 0 307 Moved Permanently 404 0 1960 Not found 410 0 1 Gone 401 0 51 Unauthorized 304 0 16591 Not Modified 200 0 373015 OK 504 0 48 Gateway Timeout 400 0 3 Bad Request 0 2 47 Not indexed yet ----------------------------- Total 2 393551
and a vacuum analyze runs nightly ...
anyone with suggestions/ideas? has to be something client/server, like mnogosearch, as we're dealing with multiple servers searching against the same database ... so I don't *think* that ht/Dig is a solution, but may be wrong there ...
---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match