Re: Re[2]: lucene scoring

Doron Cohen Fri, 08 Aug 2008 12:45:04 -0700

Following suggestion is weaker than the requested functionality, but
maybe you'll find the concept useful to ignore so called "garbage" results.

Assume that the query is a simple OR query made of a few words.
By examining the frequencies of these words in the index
(their DFs) devise a synthetic document which is the worst
document you will be willing to accept as a useful result.
Alternatively ignore DFs, but create a few documents like this -
each perhaps containing one or few of the query words (and likely
many other words). Now virtually add the synthetic document(s)
to the index. Can be done by creating a small in memory index,
and creating a multiIndexReader on top of the real index and
the dummy one. Now execute the query, with a filter that
accepts only the synthetic documents. The score of the worst
acceptable document(s) can be used as a threshold when
running the query on the original index.

It is inefficient - should be done for each query, and would be hard
to implement for general queries, and I never tried it...

Doron

2008/8/8 Александр Аристов <[EMAIL PROTECTED]>

> Query independent means that the threshold should have the same relevance
> for all queries and discard found docs below it. Current scoring
> implementation doesn't give guaranties that, say two documents found in two
> queries and which got the same score 0.5 are of the same quality.
>
> I don't want discarding docs from being indexed, no. But I want to be sure
> that two docs with the same score in two different queries have the same
> quality (they contain the same set of found terms, lenght etc.)
>
> Alexander
>
> -----Original Message-----
> From: Andrzej Bialecki <[EMAIL PROTECTED]>
> To: [email protected]
> Date: Thu, 07 Aug 2008 22:44:46 +0200
> Subject: Re: lucene scoring
>
>
> Александр Аристов wrote:
> > I want implement searching with ability to set so-called a confidence
> > level below which I would treat documents as garbage. I cannot defile
> > the level per query as the level should be relevant for all
> > documents.
>
> Hmm .. I'm not sure if I understand it properly - if the level is
> query-independent, then it's a constant factor, which you can put in a
> field during the index creation - and then you could use a Filter or
> FunctionQuery to exclude documents with this factor below the threshold.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Re[2]: lucene scoring

Reply via email to