Re: HitCollector#collect(int,float,Collection)

Michael McCandless Tue, 07 Apr 2009 01:23:45 -0700

Do you mean tracking the "atomic queries" that caused a given hit to
match (where "atomic query" is a query that actually uses
TermDocs/Positions to check matching, vs other queries like
BooleanQuery that "glomm together" sub-query matches)?


EG for a boolean query w/ N clauses, which of those N clauses matched?

This has been discussed/requested several times on java-user, and I
think it makes alot of sense.

A natural place to do this is Scorer API, ie extend it with a
"getMatchingAtomicQueries" or some such.  Probably, for efficiency,
each Query should be pre-assigned an int position, and then the
matching is represented as a bit array, reused across matches.  Your
collector could then ask the scorer for these bits if it wanted.
There should be no performance cost for collectors that don't use this
functionality.

We've also discussed (under LUCENE-1522) similar extensions to Scorer
API to get exact positions contributing to a match, and possibly using
such an API to merge in Span{Term,And,Or}Query to their "normal"
counterparts.

But we should do this separately from LUCENE-1575 -- the java ghosts
there are already challenging enough!

Mike

On Mon, Apr 6, 2009 at 11:57 PM, Shai Erera <[email protected]> wrote:
> Hi Karl,
>
> LUCENE-1575 refactors HitCollector by seperating the score from document
> collection. So if we were to introduce this type of method (that you
> suggest), it would be through a setQueries(Collection<Query>) method.
>
> Maybe you try to verify if your use case makes sense, is efficient etc.,
> before we do this change. Adding a setQueries to Collector (the new name of
> HC) shouldn't be a problem since we can always add an empty-impl method, not
> affecting back-compat. However I wonder from where will it be called,
> whether it makes sense to create that Collection object, pass it around
> while knowing that most collectors will not use it?
>
> Is it something that you perhaps can implement by extending Collector (and
> some other classes), and in your extending code call to setQueries? Today,
> as far as I remember, only Scorer calls collect() and I'm not sure if Scorer
> has the information of the matching queries. Even if it does, extending it
> and calling setQueries from the extension seems more reasonable, than adding
> such call to every query execution, which also means instantiating a new
> Collection<Query> for every search (unless we provide an API on
> IndexSearcher which allows you to pass such object).
>
> What do you think?
>
> On Tue, Apr 7, 2009 at 3:21 AM, Karl Wettin <[email protected]> wrote:
>>
>> How crazy would it be to refactor HitCollector so it also accept the
>> matching queries?
>>
>> Let's ignore my use case (not sure it makes sense yet, it's related to
>> finding a threadshold between probably interesting and definitly not
>> interesting results of huge OR-statements, but I really have to try it out
>> before I can say if it's any good) and just focus on the speed impact. If I
>> cleared and reused the Collection passed down to the HitCollector then it
>> shouldn't really slow things down, right? And if I reused the collections in
>> my TopDocsCollector as low scoring results was pushed down then it shouldn't
>> have to be expensive there either. Or?
>>
>>
>>    karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: HitCollector#collect(int,float,Collection)

Reply via email to