Hia,
I am about to write a little search-engine that allows me to
fulltext-search a large collection of PDFs. Converting the PDFs to ASCII
shouldn't be a problem, but I wonder how to store the indexed data.
IMHO just storing the fulltext data in a database and search the fields
is very ineffective.
Right now I am thinking of splitting each file into words and just
store the data in a table like
id | word | files containing word
Where the 'files containing word' would be a somewhat splitted strings
of id's that point to
id | filename
So far for my first idea... Improvements? ;)
I am asking here, because I know, use and like aspseek very much - and
right now you are my first idea on where to ask 8).
If this discussion is more related to aseek-devel, please move the
discussion (and notify me, as I am not subscribed to that list)
Balu
PS: Regarding to the actual "Debian Weekly News" I have to say the
following words:
Thank you, your work is used every day and I love it 8)