Have you tried making a BooleanQuery with a term for every word in the query document as Optional? You will get a lot of matches, ranked according to the similarity.
On Thu, Dec 12, 2019 at 10:47 AM John Brown <brown.j...@temple.edu> wrote: > > Hi, > > > > I have some questions about how to use Lucene for the specific purpose of > finding document similarities. Lucene seems to have classes that were made > for this, including: ClassicSimilarity and BM25Similarity. However I’m > fumbling a bit when it comes to implementing them. > > > > From what I understand, to use these classes you simply set the similarity > of your IndexWriter and IndexSearcher, then submit a query. The documents > returned from your query should be ordered from highest to lowest > similarity. > > > > My initial thought was to just use a phrase query to hold the "document" I > want to find similarities to, but phrase queries are limited in that they > will only return results that are deemed to fall within a certain slop > value. Term/Boolean queries are similarly limited in that they allow > documents to be sorted only if they contain all the terms in the query. > > > > Ideally, I’d like to submit a query that would essentially be a document > itself. Each of my queries contain 10 or so phrases, that each contain 5-10 > terms. I would like to compare this query with all the documents in my > index to see which is the most similar, and which is the least similar. I > feel as if there is an easy way to do this that I'm missing, after all, I > essentially just want to remove a step from the process. Any help would be > much appreciated. > > > Thank you, > > -John B --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org