Hi Israel,
I am trying to put the problem more concisely.
1. Fields where term frequency is very very relevant. E.g.
Body:
Example:
if TF of badger in Body of doc 1 > TF of badger in Body of doc 2
doc 1 scores higher.
2. Fields where term frequency is irrevalent
Page_Title:
Example:
TF of badger in PageTitle doesn't affect the score.
If that is the case:
then one solution is
1. Build the query programmatically.
2. Form Normal Queries on FieldType 1 ( e.g. Body)
3. Form ConstantScore variation of queries on FieldType 2 (e.g. Page_Title,
ConstantScoreTermQuery)
There is no need to change anything at index time.
I hope that helps.
Thanks
Umesh
On Sun, Jan 11, 2009 at 8:30 PM, Israel Tsadok <[email protected]> wrote:
> >
> > you can solve your problem at search time by passing a custom Similarity
> > class that looks something like this:
> >
> > private Similarity similarity = new DefaultSimilarity() {
> >> public float tf(float v) {
> >> return 1f;
> >> }
> >> public float tf(int i) {
> >> return 1f;
> >> }
> >> };
> >>
> >
> Thanks, but it seems that this solution would make all words completely
> equal without regard to their frequency. This is more extreme than what I
> had in mind. Chris Hostetter's suggestion of SweetSpotSimilarity makes the
> situation a little better, but still doesn't make the distinction between
> repeated words and words that appear in different location in the text.
>
> For example, an encyclopedic article about badgers would probably have the
> word "badger" many times throughout its text. I would like to make such
> article score much higher than an unrelated article that simply used the
> word badger three times in 800 words. Term Frequency works well in this
> regard, but fails to make the encyclopedic article rank higher than
> documents that simply contain the word "badger" and not much else (
> http://tinyurl.com/8p5jsj).
>
> Paul Libberecht's comment has a point - if I eliminate duplicates in the
> tokenizer both at indexing time and in the query parser, I should be able
> to
> make search work with reduced effect for repeated terms. However, that
> approach has two downsides:
> 1. It will be impossible to find articles with (specifically) "badger
> badger
> badger" in them.
> 2. Sometimes two words are repeated (barack obama barack obama barack
> obama)
> which makes the tokenizer approach unsuitable.
>
> Another option I'm considering is a negative boost to documents that
> contain
> repeated terms, but this is too general, since such a document may be very
> relevant to searches about different terms. I really only want to change
> the
> tf of the offending repeated term.
>
> Thanks for all your suggestions, and I'd appreciate any other ideas.
>
> Israel
>