where's the second option?

about your construction:
> We populate 2 tables:
> one containing the documents (text and ID)
> one containing all the words and the documents ID containing each word

how you plan to make 1 table with words, and several (unlimited)
document_ids?

what you'll need is:
1 table with doc_ids (and perhaps document)
1 table with words
1 table which links words to docs
1 table which gives the position of a word in a doc.

create table documents (doc_id integer primary key auto_increment, document
text);
create table words (word_id integer primary key auto_increment, word
varchar(255));
create table occurences (occ_id integer primary key auto_increment, doc_id
integer, word_id integer);
create table positions (pos_id integer primary key auto_increment, occ_id
integer, position integer);

this way you can handle "unlimited" words with "unlimited" occurences in
"unlimited" documents.

any other solution would force you to construct very inefficient tables, or
use of blob fields which really horribly would slow down your db when adding
data for example, and is generally a very very bad way you shouldn't even
think of.

More tables may look like more programming, but you'll notice things are
easier since this is a correct solution.

i'd suggest you read some stuff about normalisation though! get a nice
(my)sql book, or search for online docs (loads of references can be found in
this mailing list for example!)

btw: your email address looks like "postmaster", is this a bug in the lists
or are you subscribed as postmaster?

regards,

rene


----- Original Message -----
From: "Cedric Veilleux" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, March 23, 2001 4:30 AM
Subject: Large search engine


> Hi,
>
>   I am planning a very large search engine. I've spent some time reading
> the archive and I found some suggestions on how to do this. The word
> indexing method is a very interesting alternative to slow "...where like
> '%foo%';" queries.
>
>   There is from 100k to 500k documents to index, each are about 10
> KBytes large. Plain text.
>
>   The search engine will allow complex boolean queries (AND, OR, NEAR,
> NOT).
>
>   I have 2 plans in mind, I'd like to have opinions on what's would be
> the most efficient.
>
> First Way:
> We populate 2 tables:
> one containing the documents (text and ID)
> one containing all the words and the documents ID containing each word
>
> The idea is to first filter the documents and then to perform a query in
> the documents that contains at least one of the words, so we're supposed
> to get a decent speed.
>
> I know this is used by many people and I know it gives good results,
> even when searching through 100k+ documents. Although, I am wondering if
> there is not a way to do it without any use of LIKE statements. I really
> don't know if what I have in mind is a good idea, it may be completely
> stupid and inefficient, I have very little DB experiences.
>
> Anyways, what if in the table containing the words and the matching
> document ID's, we also specify where in each documents the word is
> located.
>
> ex:
> WORD|   DOCS     |   LOCATIONS
> sun | 32;45;1302 | 3 ; 554,1022 ; 76,675,3445
>
> So word sun is the third word of doc 32, the 554th and 1022th word doc
> 45, etc..
>
> Then the search script will do all the job without sending any other
> queries. May get quite complicated but it should work, it may also be
> easier to process sun NEAR star (maybe this is easy to do with LIKE too,
> I don't know, but I saw nothing in the docs.)
>
>
> Thank you,
>
> Cedric Veilleux
>
>
> ---------------------------------------------------------------------
> Before posting, please check:
>    http://www.mysql.com/manual.php   (the manual)
>    http://lists.mysql.com/           (the list archive)
>
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
>
>


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to