[ https://issues.apache.org/jira/browse/LUCENE-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935928#action_12935928 ]
Trejkaz commented on LUCENE-2506: --------------------------------- That sounds like it would cost a fair bit of memory if you had hundreds of millions of documents. The worst thing is that people actually load this much in, all the time using a desktop computer with only a few gigs of RAM. Because it's a desktop app, "why don't you get another gig of RAM for the cache" probably won't fly if it suddenly happened in a new release of our software. If it were a server app, maybe that would fly... maybe. But yeah, some variant on this which only reads some of it from disk instead of all of it, might speed things up a bit. A giant IntBuffer over a memory mapped file would probably be cached by the OS anyway. > A Stateful Filter That Works Across Index Segments > -------------------------------------------------- > > Key: LUCENE-2506 > URL: https://issues.apache.org/jira/browse/LUCENE-2506 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 3.0.2 > Reporter: Karthick Sankarachary > Attachments: LUCENE-2506.patch > > > By design, Lucene's Filter abstraction is applied once for every segment in > the index during searching. In particular, the reader provided to its > #getDocIdSet method does not represent the whole underlying index. In other > words, if the index has more than one segment the given reader only > represents a single segment. As a result, that definition of the filter > suffers the limitation of not having the ability to permit/prohibit documents > in the search results based on the terms that reside in segments that precede > the current one. > To address this limitation, we introduce here a StatefulFilter which > specifically builds on the Filter class so as to make it capable of > remembering terms in segments spanning the whole underlying index. To > reiterate, the need for making filters stateful stems from the fact that > some, although not most, filters care about the terms that they may have come > across in prior segments. It does so by keeping track of the past terms from > prior segments in a cache that is maintained in a StatefulTermsEnum instance > on a per-thread basis. > Additionally, to address the case where a filter might want to accept the > last matching term, we keep track of the TermsEnum#docFreq of the terms in > the segments filtered thus far. By comparing the sum of such > TermsEnum#docFreq with that of the top-level reader, we can tell if the > current segment is the last segment in which the current term appears. > Ideally, for this to work correctly, we require the user to explicitly set > the top-level reader on the StatefulFilter. Knowing what the top-level reader > is also helps the StatefulFilter to clean up after itself once the search has > concluded. > Note that we leave it up to each concrete sub-class of the stateful filter to > decide what to remember in its state and what not to. In other words, it can > choose to remember as much or as little from prior segments as it deems > necessary. In keeping with the TermsEnum interface, which the > StatefulTermsEnum class extends, the filter must decide which terms to accept > or not, based on the holistic state of the search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org