[jira] Commented: (LUCENE-2506) A Stateful Filter That Works Across Index Segments

Trejkaz (JIRA) Thu, 25 Nov 2010 19:22:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935928#action_12935928
 ]


Trejkaz commented on LUCENE-2506:
---------------------------------

That sounds like it would cost a fair bit of memory if you had hundreds of 
millions of documents.  The worst thing is that people actually load this much 
in, all the time using a desktop computer with only a few gigs of RAM.  Because 
it's a desktop app, "why don't you get another gig of RAM for the cache" 
probably won't fly if it suddenly happened in a new release of our software.  
If it were a server app, maybe that would fly... maybe.

But yeah, some variant on this which only reads some of it from disk instead of 
all of it, might speed things up a bit.  A giant IntBuffer over a memory mapped 
file would probably be cached by the OS anyway.


> A Stateful Filter That Works Across Index Segments
> --------------------------------------------------
>
>                 Key: LUCENE-2506
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2506
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.0.2
>            Reporter: Karthick Sankarachary
>         Attachments: LUCENE-2506.patch
>
>
> By design, Lucene's Filter abstraction is applied once for every segment in 
> the index during searching. In particular, the reader provided to its 
> #getDocIdSet method does not represent the whole underlying index. In other 
> words, if the index has more than one segment the given reader only 
> represents a single segment.  As a result, that definition of the filter 
> suffers the limitation of not having the ability to permit/prohibit documents 
> in the search results based on the terms that reside in segments that precede 
> the current one.
> To address this limitation, we introduce here a StatefulFilter which 
> specifically builds on the Filter class so as to make it capable of 
> remembering terms in segments spanning the whole underlying index. To 
> reiterate, the need for making filters stateful stems from the fact that 
> some, although not most, filters care about the terms that they may have come 
> across in prior segments. It does so by keeping track of the past terms from 
> prior segments in a cache that is maintained in a StatefulTermsEnum instance 
> on a per-thread basis. 
> Additionally, to address the case where a filter might want to accept the 
> last matching term, we keep track of the TermsEnum#docFreq of the terms in 
> the segments filtered thus far. By comparing the sum of such 
> TermsEnum#docFreq with that of the top-level reader, we can tell if the 
> current segment is the last segment in which the current term appears. 
> Ideally, for this to work correctly, we require the user to explicitly set 
> the top-level reader on the StatefulFilter. Knowing what the top-level reader 
> is also helps the StatefulFilter to clean up after itself once the search has 
> concluded.
> Note that we leave it up to each concrete sub-class of the stateful filter to 
> decide what to remember in its state and what not to. In other words, it can 
> choose to remember as much or as little from prior segments as it deems 
> necessary. In keeping with the TermsEnum interface, which the 
> StatefulTermsEnum class extends, the filter must decide which terms to accept 
> or not, based on the holistic state of the search.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2506) A Stateful Filter That Works Across Index Segments

Reply via email to