Thanks to all, I will take into account your suggestions. But I think that should have given the concrete use case. Therefore, taking into account my first example given, I have the email received by a user and that email I extract topics of interest to associate the terms of DBpedia (basically DBpedia documents). The problem here is, for example Apple, may be fruit or a company (Apple Computers). To accomplish this disambiguation, I wanted to use the abstract vs. text of the email to find out what the best term to choose.
Thanks. 2013/9/4 Allison, Timothy B. <[email protected]>: > I agree with Ivan and Koji. You also might want to look into MoreLikeThis, > which should take care of finding the highest tf*idf terms for you to use in > your query -- > http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html > > Best, > > Tim > > ________________________________________ > From: Ivan Krišto [[email protected]] > Sent: Wednesday, September 04, 2013 3:17 AM > To: [email protected] > Subject: Re: Lucene Text Similarity > > On 09/03/2013 07:33 PM, David Miranda wrote: > > Is there any way to check the similarity of texts with Lucene? I have the > DBpedia indexed and wanted to get the texts more similar between the > abstract and DBpedia another text. If I do a search in the abstract field, > with a particular text the result is not very satisfactory. Eg Abstract > DBpedia: "SoundCloud is an online audio distribution platform Which Allows > collaboration, promotion and distribution of audio recordings." My Text: > "Private Track From DJ Sneak. Download the track now in the SoundCloud > website." > > > You are attacking extremly hard problem here -- searching short documents > with a long query. This creates a lots of problems, as setting document > frequency of a term to the same magnitude of its own frequency which > instantly kills some similarity measures. > > All you can do is to experiment a lot with different similarity measures > and preprocessing steps. > > Sim measures are simple, just try them all for each preprocessing > combination. > > Suggestions of preprocessing steps: > - remove all stop words > - remove all functional words (you can find list of them at wikipedia) > - boost all uppercase words or words containing at least one uppercase > letter (add boost of 3 or 4; maybe skip first word of a sentence) > - break search text into sentences then search index for each sentence > (combine results using borda count or something similar) > - do what Koji suggested > > Regards, > Ivan Krišto > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- Cumprimentos, David Miranda --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
