Re: Any way to ignore repeated terms in TF calculation?

Umesh Prasad Mon, 12 Jan 2009 19:57:27 -0800

Hi Israel,
   I am trying to put the problem more concisely.
1. Fields where term frequency is very very relevant. E.g.
   Body:
   Example:
        if TF of badger in Body of doc 1  >   TF of badger in Body of doc 2
   doc 1 scores higher.


2. Fields where term frequency is irrevalent
   Page_Title:
   Example:
        TF of badger in PageTitle doesn't affect the score.


If that is the case:
then one solution is
1. Build the query programmatically.
2. Form Normal Queries on FieldType 1 ( e.g. Body)
3. Form ConstantScore variation of queries on FieldType 2 (e.g. Page_Title,
ConstantScoreTermQuery)

There is no need to change anything at index time.

I hope that helps.

Thanks
Umesh



On Sun, Jan 11, 2009 at 8:30 PM, Israel Tsadok <[email protected]> wrote:

> >
> > you can solve your problem at search time by passing a custom Similarity
> > class that looks something like this:
> >
> >   private Similarity similarity = new DefaultSimilarity() {
> >>    public float tf(float v) {
> >>      return 1f;
> >>    }
> >>    public float tf(int i) {
> >>      return 1f;
> >>    }
> >>  };
> >>
> >
> Thanks, but it seems that this solution would make all words completely
> equal without regard to their frequency. This is more extreme than what I
> had in mind. Chris Hostetter's suggestion of SweetSpotSimilarity makes the
> situation a little better, but still doesn't make the distinction between
> repeated words and words that appear in different location in the text.
>
> For example, an encyclopedic article about badgers would probably have the
> word "badger" many times throughout its text. I would like to make such
> article score much higher than an unrelated article that simply used the
> word badger three times in 800 words. Term Frequency works well in this
> regard, but fails to make the encyclopedic article rank higher than
> documents that simply contain the word "badger" and not much else (
> http://tinyurl.com/8p5jsj).
>
> Paul Libberecht's comment has a point - if I eliminate duplicates in the
> tokenizer both at indexing time and in the query parser, I should be able
> to
> make search work with reduced effect for repeated terms. However, that
> approach has two downsides:
> 1. It will be impossible to find articles with (specifically) "badger
> badger
> badger" in them.
> 2. Sometimes two words are repeated (barack obama barack obama barack
> obama)
> which makes the tokenizer approach unsuitable.
>
> Another option I'm considering is a negative boost to documents that
> contain
> repeated terms, but this is too general, since such a document may be very
> relevant to searches about different terms. I really only want to change
> the
> tf of the offending repeated term.
>
> Thanks for all your suggestions, and I'd appreciate any other ideas.
>
> Israel
>

Re: Any way to ignore repeated terms in TF calculation?

Reply via email to