Re: Document-Document similarity

Steve Rowe Tue, 07 Oct 2003 11:59:31 -0700

Maurice,

Why not perform document-as-query? That is, parse a document to produce a query, submit the query, and get a list of documents ranked by similarity.

Are you trying to do clustering? Write a custom analyzer which saves the analysis of each document as it's parsed for the indexing process, then iterate through all of the documents, submit each as a query, and collect the results.

Or pseudo-relevance feedback? Re-parse the top N documents resulting from a given query, bundle up the results as another query, then recombine the scores after you weight the components (Rocchio's formula; the full thing also involves a negatively reinforcing component -- re-parse the bottom M documents resulting from the initial query, package as another query, then use a negative weight when combining with other components' scores -- but this step doesn't seem to contribute positively in a reliable fashion to the overall outcome).

Steve Rowe

Maurice Coyle wrote:

does anyone know of a way to get the similarity between two documents as
opposed to between a document and a query?  at the moment, i'm forced to
make a term-frequency vector for each document and get the cosine of the
angle between them, but i was hoping there was a more elegant way of doing
this using either the lucene api (although from my study of it it doesnt
look like this is the case) or some other class library that another lucene
user has created.

any help much appreciated.

maurice

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document-Document similarity

Reply via email to