Re: Implementing takdin-like search

Yosef Leibovich Wed, 04 Aug 2004 06:10:27 -0700

Tzafrir Cohen wrote:

On Wed, Aug 04, 2004 at 01:42:11PM +0200, Yosef Leibovich wrote:

I'd like to implement FSF[this is the relation for this list] version of google-like search, that is given a large text file, a words list and an integer k, I'd like to find all integers x, so that in the range [x-k:x+k] in the text file all words occurs.
And the relation to google is?
Something to do with the ranking algorithm?

No sorry. The relation to google is that you write "tzafrir capable" and it finds "...many says of tzafrir that he's very capable..."

This is for immitating search programs behaviour, commercial examples I've found on the web are Takdin from http://www.takdinet.co.il or BIU responsa project on http://www.biu.ac.il/JH/Responsa/. The general paradigm I thought of is inserting into a huge table (berkeley database can be used for instance) with one column for every word appearing in the text, and the other column for all the occurences of this word in the text. Searching a word list will be done by choosing the word with the smallest amount of occurences, and checkin for each occurence of this word in the text whether the other words in the list occur also in the required field.
You don't tell anything about complexity requirements of certain
operations.

No requirement, a reasonable time for search is the only requirement. Either it is O(1) in all cases but O(n^2) with many-occurences cases, or it is O(log n) for everything. I need it to search ~0.5Gb of text usably, the user won't mind the analyze of the algorithm.

A trivial implementation would be a simple hash of all the words in the text, with a time complexity of about O(number of characters in text) .

This is not much of a solution, let us assume you recieve a list of all occurences of any word in the text, someone has searched "if" and "when", both has 100,000 occurences - what will you do now? my solution says, take the word with the minimal amount of occurences, and check whether the second word occurces near any occurence of the first word. What bothers me the most is the size of the index which should be as large as the original text file.

Possible optimizations are adding, for words with many occurences, we'll add to the database their neighbours as well. And deviding a large files to chunks of data, storing in the database only at which chunk does the word appear, and not all occurences in the chunk - to save index space.

Any idea/experience with doing such things?

also search for "sparse matrix" and similar stuff. Maybe this sort of data structure would help.


=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: Implementing takdin-like search

Reply via email to