Re: Any way to ignore repeated terms in TF calculation?

Karl Wettin Fri, 26 Dec 2008 16:38:41 -0800

Hi Israel,

you can solve your problem at search time by passing a customSimilarity class that looks something like this:

  private Similarity similarity = new DefaultSimilarity() {
    public float tf(float v) {
      return 1f;
    }
    public float tf(int i) {
      return 1f;
    }
  };



See javadocs for details.

  karl

25 dec 2008 kl. 14.20 skrev Israel Tsadok:

A recurring problem I have with Lucene results is when a documentcontainsthe same word over and over again. If for some reason I have adocumentcontaining "badger badger badger badger badger badger badgerbadger", it
would appear high on the search results for "badger", even though it's
usually a useless document.
What I would like to do is ignore repeating words when counting theterm
frequency. At first, I thought I could achieve this by indexing with a
TokenFilter that would skip repeated tokens, but then a search fore.g.
"Rochelle Rochelle" would return no results.
What I would like is to index all 8 "badger"s, but have thefrequency of
"badger" saved as 1. Is that even possible?

Digging around in Lucene code, I found term frequency calculations
in FreqProxTermsWriterPerField.addTerm() - is that where I need tolook?
Any help would be appreciated.
Israel



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any way to ignore repeated terms in TF calculation?

Reply via email to