Why not perform document-as-query? That is, parse a document to produce a query, submit the query, and get a list of documents ranked by similarity.
Are you trying to do clustering? Write a custom analyzer which saves the analysis of each document as it's parsed for the indexing process, then iterate through all of the documents, submit each as a query, and collect the results.
Or pseudo-relevance feedback? Re-parse the top N documents resulting from a given query, bundle up the results as another query, then recombine the scores after you weight the components (Rocchio's formula; the full thing also involves a negatively reinforcing component -- re-parse the bottom M documents resulting from the initial query, package as another query, then use a negative weight when combining with other components' scores -- but this step doesn't seem to contribute positively in a reliable fashion to the overall outcome).
Steve Rowe
Maurice Coyle wrote:
does anyone know of a way to get the similarity between two documents as opposed to between a document and a query? at the moment, i'm forced to make a term-frequency vector for each document and get the cosine of the angle between them, but i was hoping there was a more elegant way of doing this using either the lucene api (although from my study of it it doesnt look like this is the case) or some other class library that another lucene user has created.
any help much appreciated.
maurice
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
