[jira] Commented: (LUCENE-2506) A Stateful Filter That Works Across Index Segments

Trejkaz (JIRA) Thu, 25 Nov 2010 14:20:39 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935891#action_12935891
 ]


Trejkaz commented on LUCENE-2506:
---------------------------------

bq. What if Filter.getDocIDSet also received the top reader and the docBase of 
this sub reader within that top reader?

That would be enough for us, would still allow for the parallel case, and would 
even be efficient in the parallel case for the majority of our filters.  The 
bulk of our context-sensitive filters are actually only sensitive to the 
docBase - we are doing an SQL query, get back the doc IDs relative to the root 
reader and only have to offset them to the local one.

There are still filters where we would have to stop the world and go back to 
build up a filter over the whole reader (e.g. filtering out non-current copies 
of a document), but we only have one or two filters like that, it can be done 
easily using a Future, and it would impact only the speed of our own code.  (Of 
course, if Lucene ever allowed modifying existing documents in-place, it would 
remove a lot of that sort of hack, since we could have a 'current-version' 
field and remove it from the non-current copies...)


> A Stateful Filter That Works Across Index Segments
> --------------------------------------------------
>
>                 Key: LUCENE-2506
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2506
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.0.2
>            Reporter: Karthick Sankarachary
>         Attachments: LUCENE-2506.patch
>
>
> By design, Lucene's Filter abstraction is applied once for every segment in 
> the index during searching. In particular, the reader provided to its 
> #getDocIdSet method does not represent the whole underlying index. In other 
> words, if the index has more than one segment the given reader only 
> represents a single segment.  As a result, that definition of the filter 
> suffers the limitation of not having the ability to permit/prohibit documents 
> in the search results based on the terms that reside in segments that precede 
> the current one.
> To address this limitation, we introduce here a StatefulFilter which 
> specifically builds on the Filter class so as to make it capable of 
> remembering terms in segments spanning the whole underlying index. To 
> reiterate, the need for making filters stateful stems from the fact that 
> some, although not most, filters care about the terms that they may have come 
> across in prior segments. It does so by keeping track of the past terms from 
> prior segments in a cache that is maintained in a StatefulTermsEnum instance 
> on a per-thread basis. 
> Additionally, to address the case where a filter might want to accept the 
> last matching term, we keep track of the TermsEnum#docFreq of the terms in 
> the segments filtered thus far. By comparing the sum of such 
> TermsEnum#docFreq with that of the top-level reader, we can tell if the 
> current segment is the last segment in which the current term appears. 
> Ideally, for this to work correctly, we require the user to explicitly set 
> the top-level reader on the StatefulFilter. Knowing what the top-level reader 
> is also helps the StatefulFilter to clean up after itself once the search has 
> concluded.
> Note that we leave it up to each concrete sub-class of the stateful filter to 
> decide what to remember in its state and what not to. In other words, it can 
> choose to remember as much or as little from prior segments as it deems 
> necessary. In keeping with the TermsEnum interface, which the 
> StatefulTermsEnum class extends, the filter must decide which terms to accept 
> or not, based on the holistic state of the search.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2506) A Stateful Filter That Works Across Index Segments

Reply via email to