Hia,

I am about to write a little search-engine that allows me to
fulltext-search a large collection of PDFs. Converting the PDFs to ASCII
shouldn't be a problem, but I wonder how to store the indexed data.

IMHO just storing the fulltext data in a database and search the fields
is very ineffective.

Right now I am thinking of splitting each file into words and just
store the data in a table like

id | word | files containing word

Where the 'files containing word' would be a somewhat splitted strings
of id's that point to

id | filename

So far for my first idea... Improvements? ;)

I am asking here, because I know, use and like aspseek very much - and
right now you are my first idea on where to ask 8).

If this discussion is more related to aseek-devel, please move the
discussion (and notify me, as I am not subscribed to that list)


     Balu
PS: Regarding to the actual "Debian Weekly News" I have to say the
following words: 
        Thank you, your work is used every day and I love it 8)

Reply via email to