I agree with Ivan and Koji. You also might want to look into MoreLikeThis, which should take care of finding the highest tf*idf terms for you to use in your query -- http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
Best, Tim ________________________________________ From: Ivan Krišto [ivan.kri...@gmail.com] Sent: Wednesday, September 04, 2013 3:17 AM To: java-user@lucene.apache.org Subject: Re: Lucene Text Similarity On 09/03/2013 07:33 PM, David Miranda wrote: Is there any way to check the similarity of texts with Lucene? I have the DBpedia indexed and wanted to get the texts more similar between the abstract and DBpedia another text. If I do a search in the abstract field, with a particular text the result is not very satisfactory. Eg Abstract DBpedia: "SoundCloud is an online audio distribution platform Which Allows collaboration, promotion and distribution of audio recordings." My Text: "Private Track From DJ Sneak. Download the track now in the SoundCloud website." You are attacking extremly hard problem here -- searching short documents with a long query. This creates a lots of problems, as setting document frequency of a term to the same magnitude of its own frequency which instantly kills some similarity measures. All you can do is to experiment a lot with different similarity measures and preprocessing steps. Sim measures are simple, just try them all for each preprocessing combination. Suggestions of preprocessing steps: - remove all stop words - remove all functional words (you can find list of them at wikipedia) - boost all uppercase words or words containing at least one uppercase letter (add boost of 3 or 4; maybe skip first word of a sentence) - break search text into sentences then search index for each sentence (combine results using borda count or something similar) - do what Koji suggested Regards, Ivan Krišto --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org