[ 
https://issues.apache.org/jira/browse/LUCENE-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthick Sankarachary updated LUCENE-2506:
------------------------------------------

    Description: 
By design, Lucene's Filter abstraction is applied once per segment in the index 
during searching. In particular, the reader provided to its #getDocIdSet method 
does not represent the whole underlying index. In other words, if the index has 
more than one segment the given reader only represents a single segment. 

As a result, that definition of the Filter suffers from a limitation in that it 
does not have the ability to permit/prohibit documents in the search results 
based on the terms residing in not just the current segment but also the ones 
that came before it during the search. 

To address this limitation, we introduce here a StatefulFilter which 
specifically builds on the Filter class so as to make it capable of remembering 
terms in segments spanning the whole
underlying index. To reiterate, the need for making filters stateful stems from 
the fact that some, although not most, filters care about what terms they may 
have come across in prior segments. It does so by keeping track of the past 
terms from prior segments in a cache that is maintained in a StatefulTermsEnum 
instance on a per-thread basis. 

Additionally, to address the case where a filter might want to accept the last 
matching term, we keep track of the TermsEnum#docFreq of the terms in the 
segments filtered so far. By comparing the sum of such TermsEnum#docFreq with 
that in the top-level reader, we can tell if the current segment is the last 
segment in which the current term appears. Ideally, for this to work correctly, 
we require the user to explicitly set the top-level reader on the 
StatefulFilter. Knowing what the top-level reader is also helps the 
StatefulFilter to clean up after itself once the search completes.

Note that we leave it up to the concrete sub-class of the stateful filter to 
decide what to remember in its state or what not to. In other words, it can 
choose to remember as much or as little from prior segments as it desires. In 
keeping with the TermsEnum interface, which the StatefulTermsEnum class builds 
on, it must let the searcher know what terms to accept and which ones to skip 
over. More often than not, the state of the filter will come in handy while 
implementing that very acceptance logic. 

  was:
By design, Lucene's Filter abstraction is applied once per segment in the index 
during searching. In particular, the reader provided to its #getDocIdSet method 
does not represent the whole underlying index. In other words, if the index has 
more than one segment the given reader only represents a single segment. 

As a result, that definition of the Filter suffers from a limitation in that it 
does not have the ability to permit/prohibit documents in the search results 
based on the terms residing in not just the current segment but also the ones 
that came before it during the search. 

To address this limitation, we introduce here a StatefulFilter which 
specifically builds on the Filter class so as to make it capable of remembering 
terms in segments spanning the whole
underlying index. To reiterate, the need for making filters stateful stems from 
the fact that some, although not most, filters care about what terms they may 
have come across in prior segments. It does so by keeping track of the past 
terms from prior segments in a cache that is maintained in a StatefulTermsEnum 
instance on a per-thread basis.

Note that we leave it up to the concrete sub-class of the stateful filter to 
decide what to remember in its state or what not to. In other words, it can 
choose to remember as much or as little from prior segments as it desires. In 
keeping with the TermsEnum interface, which the StatefulTermsEnum class builds 
on, it must let the searcher know what terms to accept and which ones to skip 
over. More often than not, the state of the filter will come in handy while 
implementing that very acceptance logic. 


> A Stateful Filter That Works Across Index Segments
> --------------------------------------------------
>
>                 Key: LUCENE-2506
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2506
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.0.2
>            Reporter: Karthick Sankarachary
>         Attachments: LUCENE-2506.patch
>
>
> By design, Lucene's Filter abstraction is applied once per segment in the 
> index during searching. In particular, the reader provided to its 
> #getDocIdSet method does not represent the whole underlying index. In other 
> words, if the index has more than one segment the given reader only 
> represents a single segment. 
> As a result, that definition of the Filter suffers from a limitation in that 
> it does not have the ability to permit/prohibit documents in the search 
> results based on the terms residing in not just the current segment but also 
> the ones that came before it during the search. 
> To address this limitation, we introduce here a StatefulFilter which 
> specifically builds on the Filter class so as to make it capable of 
> remembering terms in segments spanning the whole
> underlying index. To reiterate, the need for making filters stateful stems 
> from the fact that some, although not most, filters care about what terms 
> they may have come across in prior segments. It does so by keeping track of 
> the past terms from prior segments in a cache that is maintained in a 
> StatefulTermsEnum instance on a per-thread basis. 
> Additionally, to address the case where a filter might want to accept the 
> last matching term, we keep track of the TermsEnum#docFreq of the terms in 
> the segments filtered so far. By comparing the sum of such TermsEnum#docFreq 
> with that in the top-level reader, we can tell if the current segment is the 
> last segment in which the current term appears. Ideally, for this to work 
> correctly, we require the user to explicitly set the top-level reader on the 
> StatefulFilter. Knowing what the top-level reader is also helps the 
> StatefulFilter to clean up after itself once the search completes.
> Note that we leave it up to the concrete sub-class of the stateful filter to 
> decide what to remember in its state or what not to. In other words, it can 
> choose to remember as much or as little from prior segments as it desires. In 
> keeping with the TermsEnum interface, which the StatefulTermsEnum class 
> builds on, it must let the searcher know what terms to accept and which ones 
> to skip over. More often than not, the state of the filter will come in handy 
> while implementing that very acceptance logic. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to