I am trying to fetch similar results to a Document in the index. The problem
are myriad of irrelevant hits the score of which is less than 1 percent. I
was thinking to write this class in order to omit these results. I can't use
TopDoc because the number of *really* similar results can be known a priori.
There might be hundereds or only 10 relevant hits...

--jaf

On 4/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

As to point <2>, the only way I was able to deal with this was by
using a TopDocs, which does have a max score. But in that case,
I don't believe you can limit the number of hits examined.

I've just got to ask... Why do you (jafarim) want to  fiddle with the
threshold? How is this going to benefit the user over and above
just getting the first N < 100 docs from a Hits object? They're
sorted already in relevancy order. Yonik's point that scores aren't
comparable across queries is well taken and should give you pause.

A clear statement of what you are trying to accomplish from the
user's perspective will allow folks to give you much more
useful responses.....

Erick

On 4/22/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On 4/22/07, jafarim <[EMAIL PROTECTED]> wrote:
> > >  Be aware that
> > > score thresholds don't work well in general since scores aren't
really
> > > comparable from one query to another.
> >
> >
> > What is I normalize the scores in such a manner that they become
between
> 0
> > and 1?
>
> Two issues with that:
> 1) You never *gain* information by normalizing in this manner.  If
> non-normalized scores aren't directly comparable, then neither will
> normalized scores be.
> 2) To normalize by dividing by the max score, you need to know the max
> score.  As hits are being collected in the HitCollector, the max score
> is not yet known.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to