Thanks for your help. I used PhraseQuery to boost close terms. I think of an idea for sop words but I don't know, if it has any drawbacks. I can index any dummy Token in place of all stop words. This token will never be searched but it will be counted as a Token and will make a space between words. Does this solution has any drawbacks?
On 10/3/05, Joaquin Delgado <[EMAIL PROTECTED]> wrote: > Chris, you may consider using a modified version of the Nutch analysis > (http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/package-summary.html) > which has a very slick treatment of stopwords. Please refer to chapter > 4, page 145 of the Lucene in Action written by Eric and Otis for some > details about the nutch implementation. > > -- J.D. > > Erik Hatcher wrote: > > > > > On Oct 3, 2005, at 4:56 AM, Chris Lamprecht wrote: > > > >>> 1- Words in Document that are more close to original search terms have > >>> a larger Score. For example, if I was searching for "wellcome", > >>> Document("wellcome") must be better than Document("welcome") > >>> > >> > >> I'm just "thinking outloud" here, but some ideas that come to mind > >> are: Index both the original text (with spelling errors), and the > >> spelling-corrected text. When you search, search on both the > >> corrected text, and in a non-required query clause search on the > >> uncorrected text, maybe boosted down a bit. This way, if the spelling > >> was correct, it will match both the original term and the corrected > >> term (since they're the same), but a document with a misspelling would > >> match only the corrected term. You'll have to experiment with boosts > >> and relevance/rankings here. > >> > >> Another idea is, if you know the number of misspellings made at > >> indexing time (it seems like you do), then boost documents based on > >> the number of spelling errors -- higher boost factor for fewer errors. > > > > > > Another tip is that score is based on term frequency - so when > > tokenizing correct spellings, add multiple of the correct words to > > weight towards them. > > > >>> 2- Documents that have search terms close to each other, have a larger > >>> Score. For example, if I was searching for "welcome there", > >>> Document("welcome there") must be better than Document("welcome all > >>> there"). Note that "all" is a stop word in my implementation. > >>> > >> > >> PhraseQuery with a high slop factor (MAX_INT works) scores higher for > >> terms that are closer together. You can construct the PhraseQuery > >> yourself (programmatically), or QueryParser takes it as: > >> > >> "welcome there"~99999 > >> > >> (with the quotes) 99999 is the slop factor, which means to accept > >> documents where "welcome" is within 99999 positions from "there". > > > > > > The issue is that "all" is a stop word, though. The StopFilter does > > not leave a hole when stop words are removed, so indexing "welcome > > all there" is exactly the same as indexing "welcome there" as far as > > the index is concerned. I started to address this situation in the > > 1.4.x Lucene releases but it introduced a backward incompatible issue > > so we reverted. Care must be taken on the Query side of things - > > PhraseQuery did not deal with anything but term position increments > > of 1, but this has been addressed in the latest codebase (in > > Subversion). > > > > I built a PositionalStopFilter for and discussed these details in the > > Analysis chapter of "Lucene in Action" - it is available in the code > > .zip at http://www.lucenebook.com > > > > Erik > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- regards, Ahmed Saad --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]