Hi Hoss, thanks for the answer, and yes you have described the problem perfectly. I think you are right lucene is in fact not the best way of solving it. I decided to simply build a letter trie consisting of all concepts and then simply do a search with that document on the trie. This brings exact matches only on the one hand (and thats exactly what I need) and furthermore yields matches even for concepts that are in plural form in the query document. So the "von Willebrands" will yield "von Willebrand".
Thanks for your efforts, Sven --- Ursprüngliche Nachricht --- Datum: 15.01.2006 22:14 Von: java-user@lucene.apache.org An: java-user@lucene.apache.org Betreff: Re: AW: Part-Of Match > > : >>von Willebrand<< is not the query but a document in the index.... The task > : is to detect exact matches of phrases inside a query (large document) with > : these phrases stored in the index. > > Lemme see if i can restate your problem... > > You want to build a data repository in which you insert a large magnatude > of "concepts" where a concept is a short phrase consisting of a few words > (possibly just one word). The words in any given concept phrase may > overlap (or be a super set) of the words in other concepts. > > Once this concept repository is built, you want to to build a black box > arround it, such that people can hand your black box a "document" > (ie: a research paper, a newpaper article, a short story, ... > some text consisting of many many sentences) and you want your black box > to then return the list of concepts that match the input document, such > that the cnceptss with the highest score are concepts whose phrase appears > exactly in the input document. Concepts whose phrase doesn't appear > exactly in the document shoudl still be returned, but with a lower score > based on how many words in the concept's phrase are found in the input > document. > > (have i adequetly described your problem?) > > It's an interesting idea. can it be done with lucene? ... i can think of > one kludgy mechanism for doing it but i'd be very suprised if there isn't > a better way (or if there is some other software library out there that > would be more suited) > > Build a permentant index in which each concept is a Lucene Document. > these documents really only need one stored/tokenized/indexed field > containing the phrase (if you want other payload fields that's up to you). > > Each time you are asked to analyze a Text sample and return matching > phrases, run the text through your analyzer to get back a tokenstream, and > for each of those tokens, use a TermDocs iterator to find out if any > phrase in your concept index contains that term, and if so which ones. > (you could also do this by building a boolean OR query out of all the > words in your input document -- but that may run into performance > limitatios if your input docs are too big, and it will try to score each > concept which isn't neccessary so even for short input text it's less > efficient). > > Now you have an (unordered) list of concepts that have something to do > with your input text. > > Next build a RAMDirectory based index consisting of exactly one document > which you build from the input text. Loop over that list of concepts you > got, and build a boolean query out of each one along the lines that > Daniel described: a phrase query on the whole concept phrase along with > term queries for each individual word -- all optional. run each of these > boolean queries against your one document RAMDirectory. the higher the > score, the better that concept applies to your input text. > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]