Basics of any fulltext search engine is that you parse a documents
(a document is 1 -> N relation of doc -> words) and form so called
"reverse index" (1 to M relation of word -> docs). When you can
easily get all the documents that includes the given word.

For speed and space considerations you use doc_id numeric value
instead of doc [name] and have a separate doc_id -> doc relation.
You can do so with the word too, if it makes sence.

You can do boolean AND/OR to search for a few words. Phrase search
is more sophisticated, it requires to store word position, too, but
it boosts relevance a lot.

In such a way you can build a very simple (basic) search engine.
If you will use SQL database, development will be much easier,
you will have to do only doc parser and write some SQL queries.
But this will slow the (search) speed down and close the door
to scalability (I mean you will not be able to deal with many
documents).

This is basically how udmsearch-1.7 (first public release of
UdmSearch) was written by Alex Barkov.

Thomas -Balu- Walter wrote:
> 
> Hia,
> 
> I am about to write a little search-engine that allows me to
> fulltext-search a large collection of PDFs. Converting the PDFs to ASCII
> shouldn't be a problem, but I wonder how to store the indexed data.
> 
> IMHO just storing the fulltext data in a database and search the fields
> is very ineffective.
> 
> Right now I am thinking of splitting each file into words and just
> store the data in a table like
> 
> id | word | files containing word
> 
> Where the 'files containing word' would be a somewhat splitted strings
> of id's that point to
> 
> id | filename
> 
> So far for my first idea... Improvements? ;)
> 
> I am asking here, because I know, use and like aspseek very much - and
> right now you are my first idea on where to ask 8).
> 
> If this discussion is more related to aseek-devel, please move the
> discussion (and notify me, as I am not subscribed to that list)
> 
>      Balu
> PS: Regarding to the actual "Debian Weekly News" I have to say the
> following words:
>         Thank you, your work is used every day and I love it 8)

-- 
[EMAIL PROTECTED]
XMMS: %s

Reply via email to