Have you tried similar-query? cts:similar-query( text { 'Mary had a little lamb whose fleece was white as snow' })
You could set the max-terms option to the count of words in the phrase, or leave it at the default 16 terms. I would leave it alone, at least at first. It should select the 16 "best" terms, which means it will tend to drop stopwords from the query if the text is long. You could also control whether or not the similar-query will use phrase search. I think it could be helpful, but you could try both ways. One potential downside is that if there are no good matches, you will probably still match on some stop-words. -- Mike On 25 May 2012, at 10:07 , seme...@hotmail.com wrote: > Getting docs that have match on a search phrase is easy (using case-, > punctuation-, white-space-, insensitive options), and finding docs that have > the highest frequency for the words in the search phrase is easy > (cts:word-query and a sequence of terms), but I want to find docs that most > closely match the search phrase. > > For example, if I have a doc that has this text in it: "Mary had a little > lamb whose fleece was white as snow" > > If I search using "mary had a little lamb whose fleece was white as SNOW!!!" > a cts:word-query would match if I sent the entire phrase and used all the > "insensitive" options. > > If I search by tokenizing the phrase into ("mary", "had", "little", "lamb", > "fleece", "white", "snow") I will get the doc that has the highest frequency > of those words (and weighted according to doc size), which may or may not be > my "Mary had a little lamb doc". > > And if I search for "Jane had a little lamb whose fleece was white as snow" > the Mary doc won't match because the phrase doesn't match, and a tokenized > words search probably won't match because some other doc with "Jane" and > "snow" or whatever would be higher priority. I can try to use a near query of > all the words except "Jane" isn't in the doc so there's be no match for my > Mary doc. > > What I want is the doc that has a phrase that most closely matches the search > phrase, even if I drop, replace, or introduce an incorrect word. And I mean > more than just spelled wrong. > > You can see that "Jane had a little lamb whose fleece was white as snow" is > really close to "Mary had a little lamb whose fleece was white as snow" but I > can't quite figure out how to get MarkLogic to determine that quickly since > the phrase won't match and tokenized words won't necessarily give me the best > relevance. I can get all the permutations of the phrase (every word with all > the other words in all combinations) and OR them together but search > performance suffers after just a few permutations. > > Anyone know how to do this? > > thanks, > -Ryan > > > _______________________________________________ > General mailing list > General@developer.marklogic.com > http://community.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list General@developer.marklogic.com http://community.marklogic.com/mailman/listinfo/general