Re: Similar Document Search

Magnus Johansson Tue, 19 Aug 2003 00:34:58 -0700

Hi Peter,

I guess you are right.

I've implemented this for a index with ten millions of really small documents that all are stored in the index. The documents are never more than a thousand words so re-indexing is quick enough. However it is probably not advisable to do this with bigger documents or documents that need additional parsing.

/magnus

Peter Becker wrote:

Hi Magnus,

thanks for the offer, but unfortunately I can't/don't want to make the assumption that I can easily access the documents to re-index them. And I don't think this approach would be feasible unless you can keep the documents in memory somehow.

Storing the other/non-inverted/normal/whatever index would be expensive for indexing, but querying should be a lot faster than having to re-index documents. That is in our situation preferable.

Peter

Magnus Johansson wrote:

Hi Peter

If the original document is available. You could extract keywords from the document at query time. That is when someone asks for documents similar to document a. You re-analyze document a and in combination with statistics from the Lucene index you extract keywords from document a that can then be used as a query for findining similar documents.

I've got some sample code if anyone is interested.

/magnus


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Similar Document Search

Reply via email to