Re: Filters and multiple, per-segment calls to getDocIdSet

Michael McCandless Fri, 26 Mar 2010 03:00:46 -0700

On Fri, Mar 26, 2010 at 1:17 AM, Daniel Noll <dan...@nuix.com> wrote:
> On Thu, Mar 25, 2010 at 21:41, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>>
>> This depends on the particulars of filter... but in general you
>> shouldn't have to consume more RAM, I think?  Ie you should be able to
>> do your computation against the top-level reader, and then store the
>> results of your computation per-sub-reader.
>
> I am having issues figuring out how to get a reference to the
> top-level reader.  The API passes them in one by one and I can't see a
> way to find the top-level reader for one which was passed in.  I can't
> easily cheat and pass the top-level one into the Filter constructor,
> because filters are serialisable and that kind of thing won't survive
> serialisation.


Probably, for now at least, you'll have to park a reference to the
current IndexReader in some known, static place, where your filter
checks?

> To throw an additional spanner in the works, the behaviour I need is
> that only the *last* document should be returned.  So even if a
> certain document matches the filter after N readers have been passed
> in, it might not match the filter after N+1 readers have been passed
> in.  Essentially I need a method like...
>
>    DocIdSet[] getDocIdSets(IndexReader[] readers);
>
> And where the readers are guaranteed to be in order of docBase.

This is a doozie of a filter :)

> By the way, I notice that the order the readers are passed to the
> method is essentially undocumented. The test code appears to be
> assuming they will be passed in the natural order of the documents
> (which is logical) but couldn't a future change parallelise segment
> searches for performance reasons, thus reordering the calls?

Yeah the order is intentionally not defined, to allow for possible
future optimizations like this.  It is in-docBase-order today, but
won't necessarily always be that way... in fact during 2.9 development
there was a brief time when it was in a different order.

> It would
> be nice if the API would explicitly pass the docBase for the
> IndexReader - this would reduce the need to perform maths to determine
> the docBase ourselves, and also make it possible to parallelise those
> calls later.

Maybe we should do that... or maybe make a new class, containing the
toplevel reader, sub reader, and docBase?  Something like that?  I
think there may be an issue already open for this...

I agree this API is problematic for context-sensitive filters (filters
that need to see all segments in the index in order to decide how to
compute the current segment's DISI).  Most of Lucene's filters are
context-free and so this API was created with that use-case in mind.
Yours is not the only context-sensitive filter out there...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Filters and multiple, per-segment calls to getDocIdSet

Reply via email to