[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395773#comment-13395773 ]
Mark Harwood commented on LUCENE-4069: -------------------------------------- Interesting results, Mike - thanks for taking the time to run them. bq. BloomFilteredFieldsProducer should just pass through intersect to the delegate? I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to) Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy. For completeness sake - I don't have access to your benchmarking code but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case. BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase. > Segment-level Bloom filters for a 2 x speed up on rare term searches > -------------------------------------------------------------------- > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0 > Reporter: Mark Harwood > Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org