[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Michael McCandless (JIRA) Mon, 18 Jun 2012 04:42:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395830#comment-13395830
 ]


Michael McCandless commented on LUCENE-4069:
--------------------------------------------

bq. Interesting results, Mike - thanks for taking the time to run them.

You're welcome!

{quote}
bq. BloomFilteredFieldsProducer should just pass through intersect to the 
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the 
client app and the delegate PostingsFormat as soon as it is safe to do so i.e. 
when the user is safely focused on a non-filtered field. While there is a 
chance the client may end up making a call to TermsEnum.seekExact(..) on a 
filtered field then I need to have a wrapper object in place which is in a 
position to intercept this call. In all other method invocations I just end up 
delegating calls so I wonder if all these extra method calls are the cause of 
the slowdown you see e.g. when Fuzzy is enumerating over many terms. 
 The only other alternatives to endlessly wrapping in this way are:
 a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out 
for just this one method.
 b) Mess around with byte-code manipulation techniques to weave in Bloom 
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to 
live with fuzzy+bloom not being as fast as straight fuzzy.
{quote}

I think the fix is simple: you are not overriding Terms.intersect now,
in BloomFilteredTerms.  I think you should override it and immediately
delegate and then FuzzyN/Respell performance should be just as good as
Lucene40 codec.

bq. For completeness sake - I don't have access to your benchmarking code

All the benchmarking code is here: 
http://code.google.com/a/apache-extras.org/p/luceneutil/

I run it nightly (trunk) and publish the results here: 
http://people.apache.org/~mikemccand/lucenebench/

bq. but I would hope that PostingsFormat.fieldsProducer() isn't called more 
than once for the same segment as that's where the Bloom filters get loaded 
from disk so there's inherent cost there too. I can't imagine this is the case.

It's only called once on init'ing the SegmentReader (or at least it
better be!).

{quote}
BTW I've just finished a long-running set of tests which mixes up reads and 
writes here: http://goo.gl/KJmGv
 This benchmark represents how graph databases such as Neo4j use Lucene for an 
index when loading (I typically use the Wikipedia links as a test set). I look 
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over 
the comparatively slower 3.6 codebase.
{quote}

Nice results!  It looks like bloom(3.6) is faster than bloom(4.0)?
Why is that...

Also I wonder why you see such sizable (3.5X speedup) gains on PK
lookup but in my benchmark I see only ~13% - 24%.  My index has 5
segments per level...

                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Reply via email to