Re: raw hit count

Ype Kingma Sun, 30 Nov 2003 05:20:52 -0800

Kent, Erik,

On Saturday 29 November 2003 17:20, Erik Hatcher wrote:
> I enjoy at least attempting to answer questions here, even if I'm half
> wrong, so by all means correct me if I misspeak....

Me too, :)

> On Saturday, November 29, 2003, at 06:37  PM, Kent Gibson wrote:
> > All I would like to know is how many times a query was
> > found in a particular document. I have no problems
> > getting the score from hits.score(). hits.length is
> > the number of times in total that the query was found,
> > however I want the the number of times the query was
> > found on a document by document basis. is this
> > possible?

Could you be a bit more precise on what you mean
by 'the number of times the query was found'? For a single
query term, it is straightforward, but what about eg. a query for three
optional terms?

>
> The 'coord' factor used in computing the score is exactly this.  See
> the javadoc for it:
>
>       http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/
> Similarity.html#coord(int,%20int)

AFAIK, this overlap is the number of terms the document and the query
have in common.
For a query consisting of a single term, the overlap is always one,
and the number of times the query occurs in a document is the term frequency
in the document.

> You could implement a custom Similarity to capture the "overlap" or
> adjust the the factor depending on what you're trying to accomplish.
>
> >  The only idea I have is to rerun the search,
> > but I can't even see how to run a search on only one
> > document!
>
> You could always rerun a search with a Filter with only one bit enabled
> and see if zero or one document is returned - that would be quite
> trivial and fast.

You could also implement a Similarity that ignores the total number
of terms in the searched document field, see lengthNorm() in
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
As lengthNorm() is applied at indexing time, you will have to reindex
for this to work for you.
At query time you can then use a tf() implementation that is linear, instead
of the default square root in DefaultSimilarity, and a constant idf(),
instead of the default log of the inverse document frequency.
You should then get a document score that is proportional
to the number of query terms in the document.

Kind regards,
Ype

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: raw hit count

Reply via email to