Re: Reordering search results

Erik Hatcher Mon, 03 Oct 2005 03:04:06 -0700


On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote:

1- Words in Document that are more close to original search termshave
a larger Score. For example, if I was searching for "wellcome",
Document("wellcome") must be better than Document("welcome")


I'm just "thinking outloud" here, but some ideas that come to mind
are:  Index both the original text (with spelling errors), and the
spelling-corrected text.  When you search, search on both the
corrected text, and in a non-required query clause search on the
uncorrected text, maybe boosted down a bit.  This way, if the spelling
was correct, it will match both the original term and the corrected
term (since they're the same), but a document with a misspelling would
match only the corrected term.  You'll have to experiment with boosts
and relevance/rankings here.

Another idea is, if you know the number of misspellings made at
indexing time (it seems like you do), then boost documents based on
the number of spelling errors -- higher boost factor for fewer errors.

Another tip is that score is based on term frequency - so whentokenizing correct spellings, add multiple of the correct words toweight towards them.

2- Documents that have search terms close to each other, have alarger

Score. For example, if I was searching for "welcome there",
Document("welcome there") must be better than Document("welcome all
there"). Note that "all" is a stop word in my implementation.


PhraseQuery with a high slop factor (MAX_INT works) scores higher for
terms that are closer together.  You can construct the PhraseQuery
yourself (programmatically), or QueryParser takes it as:

"welcome there"~99999

(with the quotes)  99999 is the slop factor, which means to accept
documents where "welcome" is within 99999 positions from "there".

The issue is that "all" is a stop word, though. The StopFilter doesnot leave a hole when stop words are removed, so indexing "welcomeall there" is exactly the same as indexing "welcome there" as far asthe index is concerned. I started to address this situation in the1.4.x Lucene releases but it introduced a backward incompatible issueso we reverted. Care must be taken on the Query side of things -PhraseQuery did not deal with anything but term position incrementsof 1, but this has been addressed in the latest codebase (inSubversion).

I built a PositionalStopFilter for and discussed these details in theAnalysis chapter of "Lucene in Action" - it is available in thecode .zip at http://www.lucenebook.com


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reordering search results

Reply via email to