Marc,

At our website we had a "in database" search as well... It was terribly slow (it was a custom built vector space model implemented in mysql+php so that explains a bit).

We replaced it by the Xapian library (www.xapian.org) with its Omega frontend as a middle end. I.e. we call with our php-scripts the omega search frontend and postprocess the results with the scripts (some rights double checks and so on), from the results we build a very simpel
SELECT ... FROM documents ... WHERE docid IN implode($docids_array)
(you understand enough php to understand this, I suppose)


With our 10GB of tekst, we have a 14GB (uncompressed, 9G compressed orso) xapian database (the largest part is for the 6.7G positional table), I'm pretty sure that if we'd store that information in something like tsearch it'd be more than that 14GB...

Searches take less than a second (unless you do phrase searches of course, that takes a few seconds and sometimes a few minutes).

I did a query on 'ext3 undelete' just a few minutes ago and it did the search in 827150 documents in only 0.027 (a second run 0.006) seconds (ext3 was found in 753 and undelete in 360 documents). Of course that is excluding the results parsing, the total time to create the webpage was "much" longer (0.43 seconds orso) due to the fact that the results needs to be transferred via xinetd and the results needs to be extracted from mysql (which is terrible with the "search supporting queries" we issue :/ ) Our search machine is very similar the machine you use as database, but it doesn't do much heavy work apart from running the xapian/omega search combination.

If you are interested in this, I can provide (much) more information about our implementation. Since you don't need right-checks, you could even get away with just the omega front end all by itself (it has a nice scripting language, but can't interface with anything but xapian).

The main advantage of taking this out of your sql database is that it runs on its own custom built storage system (and you could offload it to another machine, like we did).
Btw, if you really need an "in database" solution, read back the postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's working on integrating xapian in postgresql as a FTI)


Best regards,

Arjen van der Meijden


Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing?  the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

    Status    Expired      Total
   -----------------------------
       415          0        311 Unsupported Media Type
       302          0       1171 Moved Temporarily
       502          0         43 Bad Gateway
       414          0          3 Request-URI Too Long
       301          0        307 Moved Permanently
       404          0       1960 Not found
       410          0          1 Gone
       401          0         51 Unauthorized
       304          0      16591 Not Modified
       200          0     373015 OK
       504          0         48 Gateway Timeout
       400          0          3 Bad Request
         0          2         47 Not indexed yet
   -----------------------------
     Total          2     393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas?  has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...



---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match

Reply via email to