Arnon Klein wrote:
I think you should take a look at a data structure called suffix-tree. It is relatively easy to implement, and it has excellent complexity: linear in both creation and lookup time, and also linear in space.
No no no no! I was completely misunderstood.
There's no problem with searching a single word in the text and in fact I thought of and implemented this suffix tree while as an aide to the exercises in the University. BTW implementing this is IMHO silly, using bsddb is jiust as efficient and much more quick and convinient, there's no point with implementinng new such structures where good ones exists.
The problem is finding a set of words ["moshe","shlomo","reuven"] in a range of 10, so that I'll be able to find "...moshe has eaten shlomo. This is very sad, as reuven..." in the text.
Finding and indexing a single word is NOT the problem.
http://www.dogma.net/markn/articles/suffixt/suffixt.htm
Arnon
Yosef Leibovich wrote:
I'd like to implement FSF[this is the relation for this list] version of google-like search, that is given a large text file, a words list and an integer k, I'd like to find all integers x, so that in the range [x-k:x+k] in the text file all words occurs.
This is for immitating search programs behaviour, commercial examples I've found on the web are Takdin from http://www.takdinet.co.il or BIU responsa project on http://www.biu.ac.il/JH/Responsa/.
The general paradigm I thought of is inserting into a huge table (berkeley database can be used for instance) with one column for every word appearing in the text, and the other column for all the occurences of this word in the text. Searching a word list will be done by choosing the word with the smallest amount of occurences, and checkin for each occurence of this word in the text whether the other words in the list occur also in the required field.
Possible optimizations are adding, for words with many occurences, we'll add to the database their neighbours as well. And deviding a large files to chunks of data, storing in the database only at which chunk does the word appear, and not all occurences in the chunk - to save index space.
Any idea/experience with doing such things?
================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
