[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286707#comment-13286707 ]
Mark Harwood commented on LUCENE-4069: -------------------------------------- bq. To solve what you speak of we just need to resolve LUCENE-4093. Presumably the main objective here is that in order to cut down on the number of files we store, content consumers of various types should aim to consolidate multiple fields' contents into a single file (if they share common config choices). bq. Then multiple postings format instances that are 'the same' will be deduplicated correctly. The complication in this case is that we essentially have 2 consumers (Bloom and Lucene40), one wrapped in the other with different but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. This will be a tough one for PFPF to untangle while we are stuck with a delegating model for composing consumers. This may be made easier if instead of delegating a single stream we have a *stream-splitting* capability via a multicast subscription e.g. Bloom filtering consumer registers interest in content streams for fields A and B while Lucene40 is consolidating content from fields A, B, C and D. A broadcast mechanism feeds each consumer a copy of the relevant stream and each consumer is responsible for inventing their own file-naming convention that avoids muddling files. While that may help for writing streams it doesn't solve the re-assembly of "producer" streams at read-time where BloomFilter absolutely has to position itself in front of the standard Lucene40 producer in order to offer fast-fail lookups. In the absence of a fancy optimised routing mechanism (this all may be overkill) my current solution was to put BloomFilter in the delegate chain armed with a subset of fieldnames to observe as a larger array of fields flow past to a common delegate. I added some Javadocs to describe the need to do it this way for an efficient configuration. You are right that this is messy (ie open to bad configuration) but operating this deep down in Lucene that's always a possibility regardless of what we put in place. > Segment-level Bloom filters for a 2 x speed up on rare term searches > -------------------------------------------------------------------- > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0 > Reporter: Mark Harwood > Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org