[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Mark Harwood (JIRA) Mon, 18 Jun 2012 02:49:52 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395773#comment-13395773
 ]


Mark Harwood commented on LUCENE-4069:
--------------------------------------

Interesting results, Mike - thanks for taking the time to run them.

bq.  BloomFilteredFieldsProducer should just pass through intersect to the 
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the 
client app and the delegate PostingsFormat as soon as it is safe to do so i.e. 
when the user is safely focused on a non-filtered field. While there is a 
chance the client may end up making a call to TermsEnum.seekExact(..) on a 
filtered field then I need to have a wrapper object in place which is in a 
position to intercept this call. In all other method invocations I just end up 
delegating calls so I wonder if all these extra method calls are the cause of 
the slowdown you see e.g. when Fuzzy is enumerating over many terms. 
The only other alternatives to endlessly wrapping in this way are:
a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for 
just this one method.
b) Mess around with byte-code manipulation techniques to weave in Bloom 
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to 
live with fuzzy+bloom not being as fast as straight fuzzy.

For completeness sake - I don't have access to your benchmarking code but I 
would hope that PostingsFormat.fieldsProducer() isn't called more than once for 
the same segment as that's where the Bloom filters get loaded from disk so 
there's inherent cost there too. I can't imagine this is the case.

BTW I've just finished a long-running set of tests which mixes up reads and 
writes here: http://goo.gl/KJmGv
This benchmark represents how graph databases such as Neo4j use Lucene for an 
index when loading (I typically use the Wikipedia links as a test set). I look 
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over 
the comparatively slower 3.6 codebase.

                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Reply via email to