[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415604#comment-13415604 ]
Michael McCandless commented on LUCENE-4069: -------------------------------------------- {quote} bq. It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. {quote} At a minimum I think before committing we should make the SegmentWriteState accessible. EG then at least I can use numDocs for my primary key field. I really don't like how easy it is to silently mis-configure this PF: the default 8 MB is way to high for an NRT setting and way too low for a large index. bq. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). Hmm why is anonymity at search time important? I think that's a non-feature, and we shouldn't make our core code more complex for it? Ie, it's fine to require the app to have to make a named PF (that is accessible via SPI), implementing all their custom bloom logic (which fields are bloom'd, what hash to use, etc.). When an app makes a custom Codec/PostingsFormat, it's expected that that class is accessible via SPI at both index time and search time. > Segment-level Bloom filters for a 2 x speed up on rare term searches > -------------------------------------------------------------------- > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0-ALPHA > Reporter: Mark Harwood > Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org